itmat / rum

RNA-Seq Unified Mapper
http://cbil.upenn.edu/RUM
MIT License
26 stars 4 forks source link

fixing .sam format #33

Closed khayer closed 12 years ago

khayer commented 12 years ago

Talking to Mike about RUM, I remembered an older issue with the its output files. I do not know if the problem still exists in the newest version, but I still want to mention it. For some reason samtools does not like RUM.sam files, which makes it impossible to prepare them for further analysis. For example the GATK analysis tool only takes sorted and indexed .bam files. Being able to convert the .sam into a valid .bam file would also help with the disk space.

khayer commented 12 years ago

I just tested samtools on RUM.sam, that was created overnight. Unfortunately samtools does not even let me convert the file into .bam anymore. The error message is the following:

root@master:/data/SIM3_TEST1/test_rum# samtools view -S -b RUM.sam > RUM.bam
[samopen] SAM header is present: 34 sequences.
Parse error at line 35: sequence and quality are inconsistent
Aborted

I used samtools-0.1.18 (latest version).

EDIT: Line 35 is the first read in the file:

seq.1   99  chr15   54681682    255 1M9915N52M2403N47M  =   54694163    13635   TCTTCTGTAGATCCTTTGGTATGAAGGCGTTTATTTAGTTCTTCCAATTTGTTCTTTGGCTCTACCTTATCATCACAGGTGCAGCCCAGGTCAAAATCAG        XO:A:F  IH:i:   HI:i:1

As you can see the quality is completely missing!

mdelaurentis commented 12 years ago

Katharina also says that if the input is a fasta file, there should be a "*" in the qualities column, whereas it is currently empty.

khayer commented 12 years ago

Deleting the header lines (@SQ lines) does not solve the problem:

root@master:/data/SIM3_TEST1/test_rum# samtools view -S -b RUM_short.sam 
[samopen] no @SQ lines in the header.
[sam_read1] missing header? Abort!
?BC!sr?ed?|?t?
khayer commented 12 years ago

With the header lines included you can use samtools to convert and sort your sam file. But I just ran Picard tools on the output and I still seems funny. Picard tools is provided by samtools and is able to detect errors in the sam/bam files. I used the following command:

root@master:/data/SIM3_TEST1/test_rum# java -jar ../../Tools/picard-tools-1.67/ValidateSamFile.jar I=RUM_sorted.bam 

Summarized the errors are:

 [Wed May 02 15:11:35 UTC 2012] net.sf.picard.sam.ValidateSamFile INPUT=RUM_sorted.bam       MODE=VERBOSE MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true   IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false   VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000   CREATE_INDEX=false CREATE_MD5_FILE=false
[Wed May 02 15:11:35 UTC 2012] Executing as root@master on Linux 3.0.0-14-virtual amd64; OpenJDK 64-Bit     Server VM 1.6.0_23-b23; Picard version: 1.67(1190)
ERROR: Read groups is empty
WARNING: Record 1, Read name seq.259, NM tag (nucleotide differences) is missing
...
ERROR: Record 44, Read name seq.998, MAPQ should be 0 for unmapped read.
ERROR: Record 45, Read name seq.998, Mate negative strand flag does not match read negative strand flag of mate
...
Maximum output of [100] errors reached.

Changing the MAPQ to 0 for unmapped reads and including an NM tag should not be a problem.

Seq.998 looks like this:

seq.998 117 chr1    24622579    255 *   =   24622579    0   GAGGAAGACATATTATATGAGATTGGCTTGAAACCAATTTTAGGGGGTTCGATTCCTTCCTTTCTTATTTTACTTTTACATAGGTTGGTTCCTCGAATGT        XO:A:F  IH:i:2  HI:i:2
seq.998 153 chr1    24622579    255 100M    =   24622579    0   GTGTGATATGGTGGAGGGCAGCCATGAAGTCATTCTAAATTTGTTGAAGCATACGATACTGATATTACTTCTCGTTTTGAAGCAAAGGCCTCTCAAATTA        XO:A:F  IH:i:2  HI:i:2

This error does not seem like an easy fix, but I keep investigating.

khayer commented 12 years ago

The errors disappeared by

As for right now RUM.sam has invalid combinations in some of the bit flag. All the restrictions can be found in http://samtools.sourceforge.net/SAM1.pdf.

khayer commented 12 years ago

I just tested the GATK with the modified output. It does not accept just the * , if the qualities are missing. The quality has to have same length as the sequence itself. When I replaced the * with "IIIIIII.....IIII", GATK ran without a problem.

mdelaurentis commented 12 years ago

I believe all of these issues are fixed as of now on the 2.x branch. They'll go into 2.00_10.