t-neumann / slamdunk

Streamlining SLAM-seq analysis with ultra-high sensitivity
GNU Affero General Public License v3.0
37 stars 22 forks source link

Invalid quality values in bam file when run with fasta. #30

Open fabou-uobaf opened 6 years ago

fabou-uobaf commented 6 years ago

I mapped a fasta (instead of fastq), so holding no quality score in there. Sam specs demand the QUAL field to have the same length as the SEQ field, or to be a ''. In slamdunk dunk it seams that a bam file is produced where the QUAL score if not available consists of an '' and additional seq_length-1 non printable characters.

please try samtools view slamdunk_mapped.bam | perl -F"\t" -lane 'BEGIN{print join("\t", qw/totalLength >QUALstring< lengthNonprintableChars lengthPrintableChars/)}$cnt_n_print = 0; $cnt_print = 0;while ($F[10] =~ m/[^[:print:]]/g) {$cnt_n_print++};while ($F[10] =~ m/[[:print:]]/g) {$cnt_print++};print join("\t", length($F[10]), ">$F[10]<", $cnt_n_print, $cnt_print);' | head

t-neumann commented 6 years ago

I'm actually surprised slamdunk does not crash immediately. Why would you want to map a fasta file and then what exactly are you expecting the qual-score field in the resulting bam file to be?

fabou-uobaf commented 6 years ago

Sorry, I lost a few characters in my previous post, in particular Asteriks *.

Bottom line what I wanted to make you aware of is that the bam file produced by slamdunk does not adhere to the sam file specificiation if there is no sequence quality information available. This prohibits e.g. to run samtools view -bS on the before decompressed sam-file. I though this might be of interest to you.

From the sam file specifications: "QUAL: ... This field can be a when quality is not stored. If not a , SEQ must not be a * and the length of the quality string ought to equal the length of SEQ."

t-neumann commented 6 years ago

Ok I didn't know that - will forward it to the NextGenMap developer, that's not really something Slamdunk itself introduces. Thanks.

philres commented 6 years ago

Hi and thank you for reporting this.

Unfortunately, I could not reproduce it. I ran slamdunk map and slamdunk filter on a small FASTA file and got '*' as quality string for all reads. Your script gave the expected result as well:

totalLength >QUALstring<    lengthNonprintableChars lengthPrintableChars
1   >*< 0   1
1   >*< 0   1
1   >*< 0   1
1   >*< 0   1
1   >*< 0   1
1   >*< 0   1
1   >*< 0   1
1   >*< 0   1
1   >*< 0   1

Could you send me a part of the FASTA file that caused the problem? And could you please tell me the version of slamdunk and pysam you are using and how you installed SlamDunk?

pip freeze | grep pysam

Thanks, Philipp