vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.13k stars 195 forks source link

Explanation of BAM obtained from GAM #2263

Open ziaursani opened 5 years ago

ziaursani commented 5 years ago

Hi

I have attached three BAM files. All the three files are converted from GAM files. In all the files one but different read is mapped against reference. When I look at these files through samtools tview. One file shows complete read, second file shows part of the read and third file shows no read at all. Can anyone please explain why is this?

NCYC430_8103_p2_Reads.zip

glennhickey commented 5 years ago

It may be a bug in vg surject. Are you able to share the input and command line to reproduce?

ziaursani commented 5 years ago

I have attached the input consisting of three reads and a reference. Following are the commands that produced one of the bam files. Please help.

vg construct -C -r S288c_rDNA_cracker_fixed_241.fasta > S288c_rDNA_cracker_fixed_241.vg vg index -x S288c_rDNA_cracker_fixed_241.xg -g S288c_rDNA_cracker_fixed_241.gcsa S288c_rDNA_cracker_fixed_241.vg vg map -f NCYC430_8103_p2_Read1000.fa -x S288c_rDNA_cracker_fixed_241.xg -g S288c_rDNA_cracker_fixed_241.gcsa > NCYC430_1000_p2_Read1000.gam vg surject -x S288c_rDNA_cracker_fixed_241.xg -b NCYC430_8103_p2_Read1000.gam > NCYC430_8103_p2_Read1000.bam

Reads&Reference.zip

glennhickey commented 5 years ago

Since your reads are in FASTA and not FASTQ, you need to use vg map -F and not vg map -f. And luckily the carriage returns (\r's) in your fasta files are supported by -F since #2129.

ziaursani commented 5 years ago

Thanks Glenn, Your answer raises another question. These fasta files I converted from fastq files. This would mean fastq files may also have the same carriage returns (\r's). Earlier I have done lot of work with fastq files. Actually I got this problem from there. I just minimized the problem to only one read to show you. I have used vg map -f with those fastq files. Do you think this carriage return is causing problems? Please help.

glennhickey commented 5 years ago

Carriage returns in the fastq will probably cause problems. Best to strip them out before passing to vg map.

ziaursani commented 5 years ago

Hi Glenn, I am again facing same above problem. I have attached three bam files of one, two and four reads. All the reads are unique but are mapped against the same reference. These bam files are converted from GAM file. The samtools tview command doesn't show any read for one read bam file. It shows part of the reads for two read bam file and it shows a complete mapped reads for four read bam file. Following are the commands I have used to generate one of these files. vg construct -C -r S288c_rDNA_cracker_fixed_261.fasta > S288c_rDNA_cracker_fixed_261.vg vg index -x S288c_rDNA_cracker_fixed_261.xg -g S288c_rDNA_cracker_fixed_261.gcsa S288c_rDNA_cracker_fixed_261.vg vg map -F NCYC430_8103_p22_fwd_1Read.fa -x S288c_rDNA_cracker_fixed_261.xg -g S288c_rDNA_cracker_fixed_261.gcsa > NCYC430_8103_p22_fwd_1Read+s288c_261.gam vg surject -x S288c_rDNA_cracker_fixed_261.xg -b NCYC430_8103_p22_fwd_1Read+s288c_261.gam > NCYC430_8103_p22_fwd_1Read+s288c_261.bam I am also attaching reference and read files. Please help.

query2263.zip

glennhickey commented 5 years ago

Interesting. I can reproduce problems on the 4-read case. Running

 vg map -F NCYC430_8103_p22_fwd_4Reads.fa -x S288c_rDNA_cracker_fixed_261.xg -g S288c_rDNA_cracker_fixed_261.gcsa  | vg view -a -

repeatedly gives different outputs. Here are two (from running that same above command)

vg map -F NCYC430_8103_p22_fwd_4Reads.fa -x S288c_rDNA_cracker_fixed_261.xg -g S288c_rDNA_cracker_fixed_261.gcsa  | vg view -a -
{"sequence":"TGGGAGCTTCGGCGCCAGTGAAATACTACTACCTTTATAGTTTCTTTACTTATTCAATGAAGCGGAGCTGGAATTCATTTTCCACGTTCTAGCATTCAAGGTCCCATTCGGGGCTGATCCGGGTT","path":{"mapping":[{"position":{"node_id":"4","offset":"11"},"edit":[{"from_length":21,"to_length":21}],"rank":"1"},{"position":{"node_id":"5"},"edit":[{"from_length":5,"to_length":5},{"from_length":1,"to_length":1,"sequence":"T"},{"from_length":26,"to_length":26}],"rank":"2"},{"position":{"node_id":"6"},"edit":[{"from_length":32,"to_length":32}],"rank":"3"},{"position":{"node_id":"7"},"edit":[{"from_length":32,"to_length":32}],"rank":"4"},{"position":{"node_id":"8"},"edit":[{"from_length":8,"to_length":8}],"rank":"5"}]},"name":"read677","mapping_quality":60,"score":130,"identity":0.992,"refpos":[{"offset":"107","name":"chrXII:450575..469931"}],"time_used":371}
{"sequence":"NNNANNNNNGCCAGTGAAATACTACTACCTTTATAGTTTCTTTACTTATTCAATGAAGCGGAGCTGGAATTCATTTTCCACGTTCTAGCATTCAAGGTCCCATTCGGGGCTGATCCGGGTTGAAG","path":{"mapping":[{"position":{"node_id":"4","offset":"15"},"edit":[{"from_length":1,"to_length":1,"sequence":"N"},{"from_length":1,"to_length":1,"sequence":"N"},{"from_length":1,"to_length":1,"sequence":"N"},{"from_length":1,"to_length":1,"sequence":"A"},{"from_length":1,"to_length":1,"sequence":"N"},{"from_length":1,"to_length":1,"sequence":"N"},{"from_length":1,"to_length":1,"sequence":"N"},{"from_length":1,"to_length":1,"sequence":"N"},{"from_length":1,"to_length":1,"sequence":"N"},{"from_length":8,"to_length":8}],"rank":"1"},{"position":{"node_id":"5"},"edit":[{"from_length":5,"to_length":5},{"from_length":1,"to_length":1,"sequence":"T"},{"from_length":26,"to_length":26}],"rank":"2"},{"position":{"node_id":"6"},"edit":[{"from_length":32,"to_length":32}],"rank":"3"},{"position":{"node_id":"7"},"edit":[{"from_length":32,"to_length":32}],"rank":"4"},{"position":{"node_id":"8"},"edit":[{"from_length":12,"to_length":12}],"rank":"5"}]},"name":"read1041","mapping_quality":60,"score":117,"identity":0.92,"refpos":[{"offset":"111","name":"chrXII:450575..469931"}],"time_used":378}
{"sequence":"ACATTGTCANNNANNNNNTACTACCTTTATAGTTTCTTTACTTATTCAATGAAGCGGAGCTGGAATTCATTTTCCACGTTCTAGCATTCAAGGTCCCATTCGGGGCTG","path":{"mapping":[{"position":{"node_id":"5","offset":"6"},"edit":[{"to_length":19,"sequence":"ACATTGTCANNNANNNNNT"},{"from_length":26,"to_length":26}],"rank":"1"},{"position":{"node_id":"6"},"edit":[{"from_length":32,"to_length":32}],"rank":"2"},{"position":{"node_id":"7"},"edit":[{"from_length":31,"to_length":31}],"rank":"3"}]},"name":"read1201","mapping_quality":60,"score":94,"identity":0.82407407407407407,"refpos":[{"offset":"134","name":"chrXII:450575..469931"}],"time_used":364}
{"sequence":"GCCAGTGAAATACTACTACCTTTATAGTTTCTTTACTTATTCAATGAAGCGGAGCTGGAATTCATTTTCCACGTTCTAGCATTCAAGGTCCC","path":{"mapping":[{"position":{"node_id":"4","offset":"24"},"edit":[{"from_length":8,"to_length":8}],"rank":"1"},{"position":{"node_id":"5"},"edit":[{"from_length":5,"to_length":5},{"from_length":1,"to_length":1,"sequence":"T"},{"from_length":26,"to_length":26}],"rank":"2"},{"position":{"node_id":"6"},"edit":[{"from_length":32,"to_length":32}],"rank":"3"},{"position":{"node_id":"7"},"edit":[{"from_length":20,"to_length":20}],"rank":"4"}]},"name":"read1640","mapping_quality":60,"score":97,"identity":0.98913043478260865,"refpos":[{"offset":"120","name":"chrXII:450575..469931"}],"time_used":332}

vg map -F NCYC430_8103_p22_fwd_4Reads.fa -x S288c_rDNA_cracker_fixed_261.xg -g S288c_rDNA_cracker_fixed_261.gcsa  | vg view -a -
{"sequence":"TGTGAAGAGACATAGAGGGTGTAGAATAAGTGGGAGCTTCGGCGCCAGTGAAATACAACTACCTTTATAGTTTCTTTACTTATTCAATGAAGCGGAGCTGGAATTCATTTTCCACGTTCTAGCAT","path":{"mapping":[{"position":{"node_id":"3","offset":"13"},"edit":[{"from_length":19,"to_length":19}],"rank":"1"},{"position":{"node_id":"4"},"edit":[{"from_length":32,"to_length":32}],"rank":"2"},{"position":{"node_id":"5"},"edit":[{"from_length":5,"to_length":5},{"from_length":1,"to_length":1,"sequence":"A"},{"from_length":26,"to_length":26}],"rank":"3"},{"position":{"node_id":"6"},"edit":[{"from_length":32,"to_length":32}],"rank":"4"},{"position":{"node_id":"7"},"edit":[{"from_length":10,"to_length":10}],"rank":"5"}]},"name":"read677","mapping_quality":60,"score":130,"identity":0.992,"refpos":[{"offset":"77","name":"chrXII:450575..469931"}],"time_used":311}
{"sequence":"TACTACCTTTATAGTTTCTTTACTTATTCAATGAAGCGGAGCTGGAATTCATTTTCCACGTTCTAGCATTCAAGGTCCCATTCGGGGCTG","path":{"mapping":[{"position":{"node_id":"5","offset":"5"},"edit":[{"from_length":1,"to_length":1,"sequence":"T"},{"from_length":26,"to_length":26}],"rank":"1"},{"position":{"node_id":"6"},"edit":[{"from_length":32,"to_length":32}],"rank":"2"},{"position":{"node_id":"7"},"edit":[{"from_length":31,"to_length":31}],"rank":"3"}]},"name":"read1201","mapping_quality":60,"score":95,"identity":0.98888888888888893,"refpos":[{"offset":"133","name":"chrXII:450575..469931"}],"time_used":299}

The issue seems to go away when running single-threaded (with -t 1), where it outputs something correct-looking every time. There seems to be something wrong with multithreaded i/o here, maybe specific to fasta input. @adamnovak do you think this is related to your recent work?

adamnovak commented 5 years ago

The problem here is that reading sequence from a FASTA via FastaHack's FastaReference is not thread-safe. The object has one open file handle and just seeks it around to read whatever sequence you asked for:

https://github.com/vgteam/fastahack/blob/8e886558959b4283dde8d31144f277716157884a/Fasta.cpp#L274-L294

But we just read from it on all threads in a parallel for loop:

https://github.com/vgteam/vg/blob/e544284887891ca62846ff786b9446a8fbd7e5c9/src/subcommand/map_main.cpp#L944-L949

We need to open one FastaReference per thread if we want to map from FASTA on all threads. Otherwise we'll get cross-talk between the different reads in progress, and chimeric sequences composed of bits of different ones of them.

ekg commented 5 years ago

Ouch. Might be best to just load the FASTA file into memory and work on that. Then we can handle gzipped files as well.

On Tue, Jun 11, 2019, 21:30 Adam Novak notifications@github.com wrote:

The problem here is that reading sequence from a FASTA via FastaHack's FastaReference is not thread-safe. The object has one open file handle and just seeks it around to read whatever sequence you asked for:

https://github.com/vgteam/fastahack/blob/8e886558959b4283dde8d31144f277716157884a/Fasta.cpp#L274-L294

But we just read from it on all threads in a parallel for loop:

https://github.com/vgteam/vg/blob/e544284887891ca62846ff786b9446a8fbd7e5c9/src/subcommand/map_main.cpp#L944-L949

We need to open one FastaReference per thread if we want to map from FASTA on all threads. Otherwise we'll get cross-talk between the different reads in progress, and chimeric sequences composed of bits of different ones of them.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2263?email_source=notifications&email_token=AABDQEJMPJT3NXUE6AJCNKDPZ74M5A5CNFSM4HMLHWQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXOH2SQ#issuecomment-500989258, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEJESIKXQYMGATH6HNTPZ74M5ANCNFSM4HMLHWQA .

adamnovak commented 5 years ago

Maybe really what we want is something like for_each_parallel: stream the FASTA through and dispatch work items to be handled in threads.

@jeizenga You wrote all that fancy batching logic right? Is there an easy way to adapt it so the batching logic can be shared while FASTA parsing is swapped in at the front end?

jeizenga commented 5 years ago

Nah, @ekg wrote it originally, I just expanded on it a bit. I think it could probably done though. The sequence would need to get packed into a Protobuf object, but the Alignment should be able to handle it. Creating the Protobuf object and executing on it is all lambda functions, so modifying things is pretty simple.

Here's a place you could look for a model: https://github.com/vgteam/vg/blob/master/src/alignment.cpp#L347