itmat / rum

RNA-Seq Unified Mapper
http://cbil.upenn.edu/RUM
MIT License
26 stars 4 forks source link

Malformed CIGAR string: 1SM102S #106

Open kdaily opened 12 years ago

kdaily commented 12 years ago

Using a SAM file produced from RUM 1.11, I'm getting a strange error; the same error happens with SortSam and ValidateSamFile. According to the SAM specifications, this CIGAR string shouldn't be possible.

"S may only have H operations between them and the ends of the CIGAR string."

The SAM file can be found at the gist here:

https://gist.github.com/34dc44ea5f42ae649bb0/

java -Xmx16g -Djava.io.tmpdir='/tmp' -jar /usr/local/picard/ValidateSamFile.jar INPUT=sortsamerror.sam [Tue Aug 21 09:59:53 EDT 2012] net.sf.picard.sam.ValidateSamFile INPUT=sortsamerror.sam MODE=VERBOSE MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSI TY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false [Tue Aug 21 09:59:53 EDT 2012] Executing as user@localhost on Linux 2.6.18-274.17.1.el5 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_05-b06; Picard version: 1.73(1234) ERROR: Read groups is empty WARNING: Record 1, Read name seq.118032553, NM tag (nucleotide differences) is missing WARNING: Record 2, Read name seq.118032553, NM tag (nucleotide differences) is missing WARNING: Record 3, Read name seq.118032554, NM tag (nucleotide differences) is missing [Tue Aug 21 09:59:53 EDT 2012] net.sf.picard.sam.ValidateSamFile done. Elapsed time: 0.00 minutes. Runtime.totalMemory()=378470400
FAQ: http://sourceforge.net/apps/mediawiki/picard/index.php?title=Main_Page Exception in thread "main" java.lang.IllegalArgumentException: Malformed CIGAR string: 1SM102S at net.sf.samtools.TextCigarCodec.decode(TextCigarCodec.java:82) at net.sf.samtools.SAMRecord.getCigar(SAMRecord.java:602) at net.sf.samtools.SAMRecord.getCigarLength(SAMRecord.java:616) at net.sf.samtools.SAMRecord.isValid(SAMRecord.java:1572) at net.sf.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:431) at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:278) at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:250) at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:641) at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:619) at net.sf.picard.sam.SamFileValidator.validateSamRecords(SamFileValidator.java:234) at net.sf.picard.sam.SamFileValidator.validateSamFile(SamFileValidator.java:176) at net.sf.picard.sam.SamFileValidator.validateSamFileVerbose(SamFileValidator.java:135) at net.sf.picard.sam.ValidateSamFile.doWork(ValidateSamFile.java:156) at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177) at net.sf.picard.sam.ValidateSamFile.main(ValidateSamFile.java:92)

delagoya commented 12 years ago

RUM 1.x has known issues with SAM export format that does not comply with samtools or PICARD.

RUM 2.x SAM files have been validated with PICARD and should conform to the specification much more closely.

You can either re-align the data with RUM 2.x or use RUM 2.x to convert the RUM_unique and RUM_NU files to a SAM file with the included rum2sam.pl Perl script.

-angel

On Tuesday, August 21, 2012 at 1:16 PM, Kenneth Daily wrote:

Using a SAM file produced from RUM 1.11, I'm getting a strange error; the same error happens with SortSam and ValidateSamFile. According to the SAM specifications, this CIGAR string shouldn't be possible. "S may only have H operations between them and the ends of the CIGAR string." The SAM file can be found at the gist here:
https://gist.github.com/34dc44ea5f42ae649bb0/

java -Xmx16g -Djava.io.tmpdir='/tmp' -jar /usr/local/picard/ValidateSamFile.jar INPUT=sortsamerror.sam [Tue Aug 21 09:59:53 EDT 2012] net.sf.picard.sam.ValidateSamFile INPUT=sortsamerror.sam MODE=VERBOSE MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSI TY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false [Tue Aug 21 09:59:53 EDT 2012] Executing as user@localhost on Linux 2.6.18-274.17.1.el5 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_05-b06; Picard version: 1.73(1234) ERROR: Read groups is empty WARNING: Record 1, Read name seq.118032553, NM tag (nucleotide differences) is missing WARNING: Record 2, Read name seq.118032553, NM tag (nucleotide differences) is missing WARNING: Record 3, Read name seq.118032554, NM tag (nucleotide differences) is missing [Tue Aug 21 09:59:53 EDT 2012] net.sf.picard.sam.ValidateSamFile done. Elapsed time: 0.00 minutes. Runtime.totalMemory()=378470400

FAQ: http://sourceforge.net/apps/mediawiki/picard/index.php?title=Main_Page Exception in thread "main" java.lang.IllegalArgumentException: Malformed CIGAR string: 1SM102S at net.sf.samtools.TextCigarCodec.decode(TextCigarCodec.java:82) at net.sf.samtools.SAMRecord.getCigar(SAMRecord.java:602) at net.sf.samtools.SAMRecord.getCigarLength(SAMRecord.java:616) at net.sf.samtools.SAMRecord.isValid(SAMRecord.java:1572) at net.sf.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:431) at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:278) at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:250) at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:641) at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:619) at net.sf.picard.sam.SamFileValidator.validateSamRecords(SamFileValidator.java:234) at net.sf.picard.sam.SamFileValidator.validateSamFile(SamFileValidator.java:176) at net.sf.picard.sam.SamFileValidator.validateSamFileVerbose(SamFileValidator.java:135) at net.sf.picard.sam.ValidateSamFile.doWork(ValidateSamFile.java:156) at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177) at net.sf.picard.sam.ValidateSamFile.main(ValidateSamFile.java:92)

— Reply to this email directly or view it on GitHub (https://github.com/PGFI/rum/issues/106).

greggrant commented 12 years ago

I think he might have found a rare case, I'm going to run RUM2 against that read and make sure it's working, give me a day.

On Tue, 21 Aug 2012, Angel Pizarro wrote:

RUM 1.x has known issues with SAM export format that does not comply with samtools or PICARD.

RUM 2.x SAM files have been validated with PICARD and should conform to the specification much more closely.

You can either re-align the data with RUM 2.x or use RUM 2.x to convert the RUM_unique and RUM_NU files to a SAM file with the included rum2sam.pl Perl script.

-angel

On Tuesday, August 21, 2012 at 1:16 PM, Kenneth Daily wrote:

Using a SAM file produced from RUM 1.11, I'm getting a strange error; the same error happens with SortSam and ValidateSamFile. According to the SAM specifications, this CIGAR string shouldn't be possible. "S may only have H operations between them and the ends of the CIGAR string." The SAM file can be found at the gist here: https://gist.github.com/34dc44ea5f42ae649bb0/

java -Xmx16g -Djava.io.tmpdir='/tmp' -jar /usr/local/picard/ValidateSamFile.jar INPUT=sortsamerror.sam [Tue Aug 21 09:59:53 EDT 2012] net.sf.picard.sam.ValidateSamFile INPUT=sortsamerror.sam MODE=VERBOSE MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSI TY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false [Tue Aug 21 09:59:53 EDT 2012] Executing as user@localhost on Linux 2.6.18-274.17.1.el5 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_05-b06; Picard version: 1.73(1234) ERROR: Read groups is empty WARNING: Record 1, Read name seq.118032553, NM tag (nucleotide differences) is missing WARNING: Record 2, Read name seq.118032553, NM tag (nucleotide differences) is missing WARNING: Record 3, Read name seq.118032554, NM tag (nucleotide differences) is missing [Tue Aug 21 09:59:53 EDT 2012] net.sf.picard.sam.ValidateSamFile done. Elapsed time: 0.00 minutes. Runtime.totalMemory()=378470400

FAQ: http://sourceforge.net/apps/mediawiki/picard/index.php?title=Main_Page Exception in thread "main" java.lang.IllegalArgumentException: Malformed CIGAR string: 1SM102S at net.sf.samtools.TextCigarCodec.decode(TextCigarCodec.java:82) at net.sf.samtools.SAMRecord.getCigar(SAMRecord.java:602) at net.sf.samtools.SAMRecord.getCigarLength(SAMRecord.java:616) at net.sf.samtools.SAMRecord.isValid(SAMRecord.java:1572) at net.sf.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:431) at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:278) at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:250) at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:641) at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:619) at net.sf.picard.sam.SamFileValidator.validateSamRecords(SamFileValidator.java:234) at net.sf.picard.sam.SamFileValidator.validateSamFile(SamFileValidator.java:176) at net.sf.picard.sam.SamFileValidator.validateSamFileVerbose(SamFileValidator.java:135) at net.sf.picard.sam.ValidateSamFile.doWork(ValidateSamFile.java:156) at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177) at net.sf.picard.sam.ValidateSamFile.main(ValidateSamFile.java:92)

— Reply to this email directly or view it on GitHub (https://github.com/PGFI/rum/issues/106).


Reply to this email directly or view it on GitHub: https://github.com/PGFI/rum/issues/106#issuecomment-7911385

kdaily commented 12 years ago

Thanks, @delagoya and @greggrant. I've already invested about 500 hours of walltime to process 10 samples with RUM. Is there anything that RUM 2.x would give me that just converting the files wouldn't?

greggrant commented 12 years ago

Hi Kenneth, do you have the same bug with all 10 runs? Or did it just happen with one? You shouldn't have to realign, let me get to the bottom of the bug and get back to you asap. Thanks, Greg

On Tue, 21 Aug 2012, Kenneth Daily wrote:

Thanks, @delagoya and @greggrant. I've already invested about 500 hours of walltime to process 10 samples with RUM. Is there anything that RUM 2.x would give me that just converting the files wouldn't?


Reply to this email directly or view it on GitHub: https://github.com/PGFI/rum/issues/106#issuecomment-7912355

kdaily commented 12 years ago

Just this one read out of ~150 million reads per sample is the only troublesome one; this just happened to be the last sample that I tried to run through Picard (I assume there are no other reads in this sample that give the same error, I haven't tested by removing this one). I'm sure I could get by without this read, but figured I should report it anyways. Thanks!

greggrant commented 12 years ago

Thanks, sounds like a rare case, will fix it shortly and let you know. You shouldn't have to run RUM again, in fact you can probably just edit that one line in the sam file and rerun through picard. Thanks, Greg

On Tue, 21 Aug 2012, Kenneth Daily wrote:

Just this one read out of ~150 million reads per sample is the only troublesome one; this just happened to be the last sample that I tried to run through Picard (I assume there are no other reads in this sample that give the same error, I haven't tested by removing this one). I'm sure I could get by without this read, but figured I should report it anyways. Thanks!


Reply to this email directly or view it on GitHub: https://github.com/PGFI/rum/issues/106#issuecomment-7913078

greggrant commented 12 years ago

Hi Kenneth, can you please send me the entries for seq.118032554a and seq.118032554b in the reads.fa file. Thank you, Greg

On Tue, 21 Aug 2012, Kenneth Daily wrote:

Thanks, @delagoya and @greggrant. I've already invested about 500 hours of walltime to process 10 samples with RUM. Is there anything that RUM 2.x would give me that just converting the files wouldn't?


Reply to this email directly or view it on GitHub: https://github.com/PGFI/rum/issues/106#issuecomment-7912355

kdaily commented 12 years ago

Unfortunately, to save space my pipeline removes those files once the RUM run is completed. Restarting the run should recreate those files, correct (in a new directory, of course)?

delagoya commented 12 years ago

Hi Greg, can we reconstruct it from the SAM file here?

https://gist.github.com/34dc44ea5f42ae649bb0/

-a

Angel Pizarro Director, ITMAT Bioinformatics Facility

Calendar: http://bit.ly/pizarro-cal

On Wednesday, August 22, 2012 at 9:01 AM, Kenneth Daily wrote:

Unfortunately, to save space my pipeline removes those files once the RUM run is completed. Restarting the run should recreate those files, correct (in a new directory, of course)?

— Reply to this email directly or view it on GitHub (https://github.com/PGFI/rum/issues/106#issuecomment-7933173).

greggrant commented 12 years ago

In that case, can you send me the 118032554-th entry from your fastq files?

The following should get it:

head -472130216 | tail -4

Need both the forward and reverse.

Thank you!

On Wed, 22 Aug 2012, Kenneth Daily wrote:

Unfortunately, to save space my pipeline removes those files once the RUM run is completed. Restarting the run should recreate those files, correct (in a new directory, of course)?


Reply to this email directly or view it on GitHub: https://github.com/PGFI/rum/issues/106#issuecomment-7933173

greggrant commented 12 years ago

Yes, just not sure of the orientation, because it's a weird read pair where both forward and reverse reads are basically identical, save a few mismatchds near one end.

On Wed, 22 Aug 2012, Angel Pizarro wrote:

Hi Greg, can we reconstruct it from the SAM file here?

https://gist.github.com/34dc44ea5f42ae649bb0/

-a

Angel Pizarro Director, ITMAT Bioinformatics Facility

Calendar: http://bit.ly/pizarro-cal

On Wednesday, August 22, 2012 at 9:01 AM, Kenneth Daily wrote:

Unfortunately, to save space my pipeline removes those files once the RUM run is completed. Restarting the run should recreate those files, correct (in a new directory, of course)?

— Reply to this email directly or view it on GitHub (https://github.com/PGFI/rum/issues/106#issuecomment-7933173).


Reply to this email directly or view it on GitHub: https://github.com/PGFI/rum/issues/106#issuecomment-7933226

kdaily commented 12 years ago

From reads.fa:

>seq.118032554a
GATGGATAGTATATAGATGGATAGATGAGTGGATGGATAGTTAATGGATAGTATATAGATGGATAGATGAGGAGGTGGATGGATAGTGTATAGATGGATGGAC
>seq.118032554b
ATCTATCCATCTATATACTATCCATCCACCTACTCATCTATCCATCTATATACTATCCATTAACTATCCATCCACTCATCTATCCATCTATATACTATCCATC

From fastq:

@HWI-ST165:283:D09WYACXX:3:2104:9088:81104 1:N:0:
GATGGATAGTATATAGATGGATAGATGAGTGGATGGATAGTTAATGGATAGTATATAGATGGATAGATGAGGAGGTGGATGGATAGTGTATAGATGGATGGAC
+
@@@FFAEEFHDHDIGIGHIJFIGIHIICGAEHEHHJ@FEHGHCCIGGDHBG?DG@GHGGIGGBGGGGHC>D)8@F5@CCHHF=>>D);;?CCD>C3@>CCBA3

@HWI-ST165:283:D09WYACXX:3:2104:9088:81104 2:N:0:
ATCTATCCATCTATATACTATCCATCCACCTACTCATCTATCCATCTATATACTATCCATTAACTATCCATCCACTCATCTATCCATCTATATACTATCCATC
+
@CCFFFFFGGGGHJIIIIJJJJJJIIIJDHIEHCGGGIIIJIIEHIJIIFGGGGHI>@GGIEGIJIJIBGCGIGGIIJJE>GIIHFHHHHFCDCBCCFAECEC

kdaily commented 12 years ago

On a mildly related note, when I try to run RUM 1.11 with just these two paired reads, it never seems to finish; is there some issue with only having a single mapped read pair?

mdelaurentis commented 12 years ago

Yes, that is an issue with RUM 1.11 that has been fixed on the 2.x branch. I believe the script that parses the output of BLAT will loop forever if it gets an empty input file from BLAT. This bug only really shows up if you run a very small number of mappings, so that they all get mapped by Bowtie. Thanks for the report.