piosierra / pantera

Identification of transposable element families from pangenome polymorphisms
MIT License
39 stars 1 forks source link

GFA files created by vg #2

Closed zhengluo-lz closed 3 months ago

zhengluo-lz commented 3 months ago

Hi Pio,

Could you supply a small set of test files created by vg for pantera? I get an error when using the GFA format created by vg as input.

piosierra commented 3 months ago

Hello, Can you run head -n 1000 yourfile.gfa and send me the result? Also, can you share the actual error you get? Thanks.

zhengluo-lz commented 3 months ago

test.zip Sure, this is the gfa file created by vg, but when I run this command pantera.R -g test.gfa.1 -o test_output, the error message is below

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  line 5330 did not have 3 elements
Calls: get_segments -> read.table -> scan
Execution halted
piosierra commented 3 months ago

Thanks, I uploaded a new version fixing that bug. But take into account that in the test file you sent pantera will not run anyway due to the polymorphic segments being too small. You can reduce the minimum size (-s)

piosierra commented 3 months ago

Implemented temporal fix for fread bug. Added a check for minimum number of polymorphic segments.

zhengluo-lz commented 3 months ago

Thank you, I will try it again.

zhengluo-lz commented 3 months ago

Why do I still get the same error when I run the new script?

piosierra commented 3 months ago

I can confirm the fix works in Linux. What system are you using?

zhengluo-lz commented 3 months ago

I use the Linux systerm, are you sure you can run the new script on the GFA file provided by me?

piosierra commented 3 months ago

Yes, that is why I uploaded the fix. Do you get exactly the same error? Can you confirm you are running the new one, just in case.

zhengluo-lz commented 3 months ago

Yes, I downloaded the script you just uploaded but still get the same error.

piosierra commented 3 months ago

If you run echo -e '@@A\tB\tC\tD\tE\tF';head test.gfa Is this what you get? @@A B C D E F H VN:Z:1.1 S 1 TAAACCCTAAACCCTAAACCCTAAACCCTAAA S 2 CCCTAAACCCTAAACCCTAAACCCTAAACCCT S 3 AAAACCCTAAACCCTAAACCCTAAAACCCTAA S 4 ACCCTAAACCCTAAACCCTAAACCCTAAACCC S 5 TAAACCCTAAACCCTAAACCCTAAACCCTAAA S 6 CCCTAAACCCTAAACCCTAAACCCTAAACCCT S 7 AAACCCTAAACCCTAAACCCTAAACCCTAAAC S 8 CCTAAACCCTAAACCCTAAAACCCTAAACCCT S 9 AAACCCTAAACCCTAAACCCTAAACCCTAAAC

zhengluo-lz commented 3 months ago

Yes, it's the same as the one you provided.

piosierra commented 3 months ago

In case it is related. In your comment you say you run test.gfa.1, but the file you sent me is test.gfa. Are you sure we talk about the same file? I can confirm the one you shared does not return an error on this version of pantera running in Linux.

zhengluo-lz commented 3 months ago

Oh, I see, I got the wrong file, but now there's a new error.

Error in nchar(seq) :
  cannot coerce type 'closure' to vector of type 'character'
Calls: [ -> [.data.table -> eval -> eval -> nchar -> nchar
In addition: Warning messages:
1: File '/tmp/RtmpBChtKw/file73a4d663968e6' has size 0. Returning a NULL data.table.
2: File '/tmp/RtmpBChtKw/file73a4d64c3ae24' has size 0. Returning a NULL data.table.
Execution halted
piosierra commented 3 months ago

Please, share the new gfa file, if it is not too large, and the pantera.log of that run.

zhengluo-lz commented 3 months ago

test.file.zip This is the new gfa file, Thanks.

piosierra commented 3 months ago

Also, which options did you use? As I mentioned. I don't think that gfa is a good representation of a pangenome of the variation graph type, as all segments have the same size (32). Pantera will not work on correctly on that gfa. Can you share how it was generated?

piosierra commented 3 months ago

Maybe it is due to it finding some temporal files on the folder of an aborted run. Can you confirm that trying to use a different output folder? If that is the problem I will add a check to confirm the output folder does not exists.

zhengluo-lz commented 3 months ago

source.file.zip

Here is the VCF and FASTA file I used.I generated the GFA file using the following command.

vg autoindex --prefix test --workflow giraffe --ref-fasta test.fa --vcf test.vcf.gz
vg convert --gfa-out --gbwtgraph-algorithm --no-wline test.giraffe.gbz > test.gfa
vg convert -fW --gfa-in test.gfa > test.new.gfa

test.gfa and test.new.gfa are different versions of the GFA file.

zhengluo-lz commented 3 months ago

After changing the output folder name, there were no errors, but there were also no results. Is this because the GFA format created by vg has issues?

piosierra commented 3 months ago

Thanks. I will upload a fix requiring the exit folder to not exist to avoid issues. Regarding the vg file. I was reading about how it was formed and it seems to me that that gfa is mostly used and a way to pass information between tools, but that is has not been prepared to collapse the paths into common segments. I would suggest you use the results of either pggb or minigraph directly.

zhengluo-lz commented 3 months ago

Sure, thank you. This is an excellent software for identifying transposons, especially for those working in maize genomics. I will recommend your software to more researchers who are working on maize genomics and transposon-related studies.

piosierra commented 3 months ago

谢谢你!

piosierra commented 3 months ago

Uploaded fix to require output folder and prevent errors after failed runs.