simon-anders / htseq

HTSeq is a Python library to facilitate processing and analysis of data from high-throughput sequencing (HTS) experiments.
https://htseq.readthedocs.io/en/release_0.11.1/
GNU General Public License v3.0
122 stars 77 forks source link

* in column 7 of sam is not recognized #51

Closed jtmieg closed 6 years ago

jtmieg commented 6 years ago

In SAM files, one often encounters a in column 7, and this is in conformity with the SAM specification. HTseq crashes on the . If you replace the by a zero 0 in column 7, then HTseq is happy. I suggest that HTseq should accept the in column 7, but have not studied all the consequences. What do you think Thank you

fbucchini commented 6 years ago

Hi,

Have you managed to find another workaround? I am also experiencing this issue.

iosonofabio commented 6 years ago

OK guys, let's try to get this done. Either of you has a minimal example?

Is this referring to htseq-count or to HTSeq as a library? If the latter, what function fails and what's the error message?

Thanks

jtmieg commented 6 years ago

Hello Fabio

I used a workaround in my python code As you can see i pipe the output of samtools through a gawk which removes the * in column 7 and replaces it by a zero Then i can use import HTSeq for a in HTSeq.SAM_Reader( input_stream ): analyze the alignement in python

this is clearly suboptimal, but it works i am not sure why i had a in column 7 in the first place, but this came out of using several aligners (star, tophat2, hisat2 at least one of them gave me a

it is great if you can just add this replacement in silent mode inside « HTSeq.SAM_Reader( input_stream )

thanks for maintaining the code jean

Select the input stream

gg = """ gawk -F '\t' '{gsub("*","0",$7);printf("%s",$1);for(i=2;i<=NF;i++)printf("\t%s",$i);printf("\n");}' """

if input_file == "": input_stream = os.popen( " sort -k 1,1 ")

sys.stdin

else: if file_type == "BAM": input_stream = os.popen( "samtools view -h " + input_file + " | " + gg + " | sort -T . -k 1,1 ") elif file_type == "SAMSORTED": input_stream = open(input_file, "r") ; elif file_type == "SAMGZ": input_stream = os.popen( "gunzip -c " + input_file + " | " + gg + " | sort -T . -k 1,1 ") else: input_stream = os.popen( "cat " + input_file + " | " + gg + " | sort -T . -k 1,1 ")

From: Fabio Zanini notifications@github.com Reply-To: simon-anders/htseq reply@reply.github.com Date: Sunday, April 22, 2018 at 3:54 PM To: simon-anders/htseq htseq@noreply.github.com Cc: Jean et Danielle Thierry-Mieg mieg@ncbi.nlm.nih.gov, Author author@noreply.github.com Subject: Re: [simon-anders/htseq] * in column 7 of sam is not recognized (#51)

OK guys, let's try to get this done. Either of you has a minimal example?

Is this referring to htseq-count or to HTSeq as a library? If the latter, what function fails and what's the error message?

Thanks

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/simon-anders/htseq/issues/51#issuecomment-383407744, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHNv2phHR4KMXHN6QSLBIQdWnUXBS4jZks5trN_ZgaJpZM4SrChC.

iosonofabio commented 6 years ago
  1. Python 2 or 3?
  2. Are these reads illumina paired end?

On April 22, 2018 5:36:18 PM PDT, jtmieg notifications@github.com wrote:

Hello Fabio

I used a workaround in my python code As you can see i pipe the output of samtools through a gawk which removes the * in column 7 and replaces it by a zero Then i can use import HTSeq for a in HTSeq.SAM_Reader( input_stream ): analyze the alignement in python

this is clearly suboptimal, but it works i am not sure why i had a in column 7 in the first place, but this came out of using several aligners (star, tophat2, hisat2 at least one of them gave me a

it is great if you can just add this replacement in silent mode inside « HTSeq.SAM_Reader( input_stream )

thanks for maintaining the code jean

Select the input stream

gg = """ gawk -F '\t' '{gsub("*","0",$7);printf("%s",$1);for(i=2;i<=NF;i++)printf("\t%s",$i);printf("\n");}' """

if input_file == "": input_stream = os.popen( " sort -k 1,1 ")

sys.stdin

else: if file_type == "BAM": input_stream = os.popen( "samtools view -h " + input_file + " | " + gg

  • " | sort -T . -k 1,1 ") elif file_type == "SAMSORTED": input_stream = open(input_file, "r") ; elif file_type == "SAMGZ": input_stream = os.popen( "gunzip -c " + input_file + " | " + gg + " | sort -T . -k 1,1 ") else: input_stream = os.popen( "cat " + input_file + " | " + gg + " | sort -T . -k 1,1 ")

From: Fabio Zanini notifications@github.com Reply-To: simon-anders/htseq reply@reply.github.com Date: Sunday, April 22, 2018 at 3:54 PM To: simon-anders/htseq htseq@noreply.github.com Cc: Jean et Danielle Thierry-Mieg mieg@ncbi.nlm.nih.gov, Author author@noreply.github.com Subject: Re: [simon-anders/htseq] * in column 7 of sam is not recognized (#51)

OK guys, let's try to get this done. Either of you has a minimal example?

Is this referring to htseq-count or to HTSeq as a library? If the latter, what function fails and what's the error message?

Thanks

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/simon-anders/htseq/issues/51#issuecomment-383407744, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHNv2phHR4KMXHN6QSLBIQdWnUXBS4jZks5trN_ZgaJpZM4SrChC.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/simon-anders/htseq/issues/51#issuecomment-383425227

fbucchini commented 6 years ago

Hi Fabio,

For me this problem occurred when I was using htseq-count. However, I think that it was simply a conflict between two HTSeq versions on my system, since I don't experience this problem anymore after a clean re-install (with version 0.9.1).

iosonofabio commented 6 years ago

ok guys, seems like you solved on your own, closing until further notice