philres / ngmlr

NGMLR is a long-read mapper designed to align PacBio or Oxford Nanopore (standard and ultra-long) to a reference genome with a focus on reads that span structural variations
MIT License
293 stars 40 forks source link

RG option doesn't work as described #43

Closed ghannum closed 6 years ago

ghannum commented 6 years ago

The help menu describes the use of --rg-id as "Adds RG:Z: to all alignments in SAM/BAM".

This is not the case - no information is added to the alignments themselves. This would be a useful option if it worked properly.

Also, the help menu appears to have this text for the description of '-o', which is a bug.

wdecoster commented 6 years ago

I see RG:Z:foo in a toy example I just ran with ngmlr 0.2.7 (installed from bioconda). I confirm that the help text contains the described bug.

ghannum commented 6 years ago

I get help text in the header, but not for each read individually. Did you get both?

philres commented 6 years ago

Hi Ghannum,

could you please send me the command you used to run ngmlr?

Thanks, Philipp

ghannum commented 6 years ago

ngmlr 0.2.7 (build: Jun 25 2018 10:17:59, start: 2018-07-17.15:49:31)

ngmlr -r ref.fasta -q reads.fastq.gz --rg-id SAMPLEID --rg-sm SAMPLEID -t 32 -x pacbio | samtools view -bS - | samtools sort -@ 32 - -o SAMPLEID.pacbio.bam

I get the same issue without all of the samtools piping...

philres commented 6 years ago

Thanks! Using this parameters, ngmlr 0.2.7 gives me the following:

@HD     VN:1.0  SO:unsorted
@SQ     SN:NC_001416.1  LN:48502
@PG     ID:ngmlr        PN:nextgenmap-lr        VN:0.2.7        CL:ngmlr -r ref.fa -q read.fastq --rg-id SAMPLEID --rg-sm SAMPLEID -t 1 -x pacbio
@RG     ID:SAMPLEID     SM:SAMPLEID
5a68e600-dcdd-4ef1-8a1f-0fa11dc7fffb    0       NC_001416.1     26539   60      32S20M1I1M1I40M3I5M1D47M1I59M2D8M1D1M1D13M2D12M2D34M1I28M1D10M2D12M1I31M2I46M4I63M1D6M1I16M1I22M1D9M2I21M1I9M1D5M2D23M2D24M1D2M1D21M3D54M1I44M2D10M1D37M1D1M1D10M1D11M1D75M2I3M1D39M3D15M1D31M2D21M1D21M1I14M1I4M1D23M1D6M1D42M20S      *       0       0       TTGTTATTGCTTCGTTCAGTTACAGTATTGCTCTGTCCATGAATTTCTGAAAGGAAGTTACCCCTCTAAGTAATGAGGTGGTAAGGACGCTTTCATTTTTTTCGTGTCGGCTAATCGATTTGGCCATACTACTAAATCCTGAATAGCTTTAAAGAAGGTTATGTTTAAAACCATCGCTTAATTTGCTGAGATTAACATAGTAGTCAATGCTCACCTAAGAAAAAACATTTCAGAGTTGACTGAATTTTATCTATTAATGAATAAGTGCTTACTTCTTCTTTTTTGACCTACAAAACCAATTTTAACATTCCGATATCATTTTTCACCGTAACTCATCGAAGACAGTAAGATAAAACATTATGGCACAAAGGAATAGTCATTTCAACCATCTGCTCGTAGGAATGCCTTATTTTTTTTTTTCTACTGCAGGAATATACCCGCCTCTTTCAATAACACTAAACTCCAACATATAGTAACCTTAATTTTTATTAAAATAACCACCAATTTATTTGGCGGCAACACGGATCTCTCTTTTTTAAGTTACTCTCTATTACGCCACGTTTTCATCTAAAGTGGTAGTATTGAACTTAACGGAATCGTATTGTAGTTTTCCATATTGCTTTCTGCTTCCTTTTGGATGTACTGTTATTCATGTTGCATGGTGCACTGTTTATACCAACGATATAGTCTATTAAATGCATATATGGTATCGCCGAACGATTAGCTCTTCAGGCTTCTAGAAGCGTTTAAGTACTAATAAGCCGATAGATAGCCACGGACTTCATGTATTTTTCATAGTCTTAACTTCGCTCCTCGCTCATAACAGACATTCACTACAGTTATGGCGGAAAGGTATGCATGCTGGGTGTGGGGAAGTCGTGAGAAAGAAAGAAGTCAACTGCGTCGTTTGACATCACCGCTATCTTACTGGTTATGCAGGTGTAGTGGGTAGCACACAAAGCTTTGCACTGGTGCGAGGCTTTGTGCTTCTCTGAGTGCGACAGGTTTGATGACAAAAAAATTAGCGCAGAGGAACAAAAATCACCTTGCGCTAATGTAAATTTGGGTCACTAATACCATCTAAGTAGTTGATTCATAGTGACTGGATGGCAATACGTAACTACC   &$+*+(5'%(@+=669>375A,0*$+&*+&).=??>07//(1+7G//@393(-,*,+/H22FG8-934>4-;-,-527-./0.71>30-+AA/.=B;893-**$--774<<8;DDA>@0E5-45?9EBFB<AAAG>=:630),&,&*/874@++2633.3.449G?;FJDCBAAE1.</9-A03+.0C22272491A@:7797:D;G/,,,'(40)'/*2>EEF712?J8:0*(/8>6?A9751:0-83??C9:0;104HIFEA<861=112;GAAFAE<IHJ,/0<>98-97D><CC8AFF:ACD:C2/+()/,/299@HH;1;>@*7(%*-@=A7&.*&,)&*,7B>A?AGG5/-7.%++.+&)>F7273>=<CEC7?D-90C/;2/062-00AB?FE?E?@F4F00GBCBHEDF@A-80',)*4464:64A<79C739A<0DF0,@/.AF>BCDFGD653641037;?A40;0/(''60;FJF3;4H7JHD/8558.636.51.'8B):*,:3/9.1-0.4166656*;6;>@3510/<00248EDDBE25''&(,26@F753:@==>4*2'),007%1&&+;40D2F22--()+::13../019IJ83,1B5=?1.2D:251-6A9=8IH<@G96%*'+*)11&0255E9AE:?DD@5724(.00@G635=:C<F45549I7?B<>;?<07)-,**/15CA*=<;?EBCEBBF?860A)./ADE96&4=6;<5+''2+))0<A031)*.328739B3@/-/-,'/05*-4)..7>>=A.%,*184IJJG3;;150%++)(+013/48=9.6/71<-05=6&+,?2>>>=@A<D(3171/C11@B9C05<=??B?AA?BBAF;4+6'-2*1,-456<)*&2<3)-85',0-/-4(.//2467:@F@7;6974/3,-.9621.7)22>7G689&A+<58*-.%6.3..'58(9;==A3.22:;,0.74*'9,.3-4714;30>56>@>>=0*)+,8;DDF)?/=>.*1-,,-4=?FEA820/.019://1-5-/>GII5/34ED<E78547210.25<,51(.5+++5766C;:=DB@@>?C9<:<>B;9>?:23/.-.'(%'+(/-602A:9;><>+0)('%   RG:Z:SAMPLEID   AS:i:1568       NM:i:95 XI:f:0.9148     XS:i:0  XE:i:1568       XR:i:1073       MD:Z:46T19^A0A105^TT8^G1^A13^GG12^TT62^T10^GC10A1G6A21G1A17C90^C22G21^A30A0T7^C5^AA3T1A17^GG2C21^T2^A21^AAC0C0C63A32^GA10^C35G1^A1^C0C9^A3G7^C78^A10G19T8^CTT15^C9G21^AT21^G37A1^G23^C1C0T0G2^A0C0A40   SV:i:2  QS:i:32 QE:i:1105       CV:f:95.377777

The header contains the RG-ID and the additional information (which in this case is only RG-SM) and the reads have the RG tag with the RG-ID referencing the information in the header. This is what the SAM specification suggest and as far as I know what other read aligners do as well.

Are you not getting the RG:Z tags in the alignments?

Thanks, Philipp

fritzsedlazeck commented 6 years ago

Just to clarify @ghannum do you see any RG:Z: Tag in the entries of the reads? You can check with samtools view your.bam | less

Thanks Fritz

ghannum commented 6 years ago

Correct - I am not getting any RG:Z tags in the alignments.

My header has: @RG ID:SAMPLEID SM:SAMPLEID

But my reads look like this: m54114_180515_083632/6684916/6689_6758 16 ref_chr_4 2040455 0 39S3M1I19M7S * 0 0 ACCACACAGATCTGGGAGATTCTCTGCAAAAAACAACAACGGAGGAGAGGAAAAGCGGGGGGAAGAGTC !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! AS:i:32 NM:i:2 XI:f:0.913 XS:i:0 XE:i:32 XR:i:23 MD:Z:7G14 SV:i:0 SA:Z:ref_chr_1,4831832,-,41S19M1D3M6S,0,2; QS:i:39 QE:i:62 CV:f:33.333332

philres commented 6 years ago

Could you please try to either pull the newest commits from github and rebuilding from source or using ngmlr 0.2.7 installed from conda and confirm you still see the same problem.

Unfortunately, I can't reproduce it myself.

ghannum commented 6 years ago

I'm running from the binary compiled as part of the 0.2.7 release: https://github.com/philres/ngmlr/releases/download/v0.2.7/ngmlr-0.2.7-linux-x86_64.tar.gz

According to git, the only changes from here are in the README.

Another issue you might look at (and might be related) is in the help text. Here is what I see at the top... notice the wrong description for the '-o' option.

ngmlr --help ngmlr 0.2.7 (build: Jun 25 2018 10:17:59, start: 2018-07-17.10:06:00) Contact: philipp.rescheneder@univie.ac.at

Usage: ngmlr [options] -r -q [-o ]

Input/Output: -r , --reference (required) Path to the reference genome (FASTA/Q, can be gzipped) -q , --query Path to the read file (FASTA/Q) [/dev/stdin] -o , --output Adds RG:Z: to all alignments in SAM/BAM [none] --skip-write Don't write reference index to disk [false] ...

philres commented 6 years ago

I tried it again with the version from here (https://github.com/philres/ngmlr/releases/download/v0.2.7/ngmlr-0.2.7-linux-x86_64.tar.gz) and still get RG:Z tags for all reads.

Would it be possible to send me a read file with a couple of reads that cause the problem + the reference or information on what your reference genome is? I would be very surprised if this differed between datasets, but it is the last thing I can think of.

Thanks for pointing out the error in the help text. I just corrected it in the source code. However, this is unrelated to the behavior you are seeing.

ghannum commented 6 years ago

The data and reference are proprietary - and I'm not sure if I can anonymize it without compromising the test. Do you have a public ref & small fastq I could try to match your end?

fritzsedlazeck commented 6 years ago

here is a sample that I just generated and tested. I reinstalled ngmlr over conda and run it like this: ngmlr -r ref.fa -q tmp.fa --rg-id SAMPLEID --rg-sm SAMPLEID -o tmp.sam

The tmp.sam included the header and entry for the read groups per read.

Please let us know if that works for your installation. Thx Fritz ref.fa.gz tmp.fa.gz

ghannum commented 6 years ago

aha! Your data and command line produce the expected output. After playing with it a bit, I found the way to reproduce the bug is with the multithreaded '-t 32' option. When it is off I get the RG code, when it is on I do not.

fritzsedlazeck commented 6 years ago

Ok thanks. We will look into this. Cheers Frits

philres commented 6 years ago

Ok this was a tricky one. Thanks a lot for reporting this. I think I found the problem and it works for me now. I have attached a binary to this comment. Could you please try running the binary with your data and verify that it works now as expected?

Thanks, Philipp

ngmlr-0.2.8-dev.tar.gz

ghannum commented 6 years ago

Works great for me. Thanks for your help!

wdecoster commented 6 years ago

Hi Philipp,

Are you planning on making a new release with these changes?
I'll take care of the bioconda recipe then.

Wouter

williamrowell commented 4 years ago

Was there ever a new release with this fix? The most recent tag is 0.2.7 and predates this bug fix.

wdecoster commented 4 years ago

Not that I am aware of.

fritzsedlazeck commented 4 years ago

yeah sorry Philipp was the one that developed this and I currently dont have the resources.. Hopefully in the near future.