Closed mickaelsilva closed 7 years ago
Thanks for the report, we are not profiling the code but it's probably UTF-8 encoding related. Will have a look next week thx!
On May 23, 2017 8:01:19 AM PDT, mickaelsilva notifications@github.com wrote:
I've been using HTSeq to read fasta files and I noticed a huge increase in time when iterating a fasta object between python 2.7 (HTSeq version 0.6.1p1) and 3.4 (HTSeq version 0.7.2).
For instance :
for contig in HTSeq.FastaReader( "genomeX.fasta" ): #do nothing pass
this task takes 0.01 sec using python 2.7 compared to 6.7 sec with python 3.4 (bacteria genome size 2.2 Mb)
Did you ever notice such behavior?
-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/simon-anders/htseq/issues/28
Hey @mickaelsilva, I looked into this. It would be easy to make this fast again, but we would need to read FASTA as a binary format which is kind of itchy since FASTA is commonly used as an ASCII text format. In other words, if we optimize this out then people cannot use non-ASCII characters in their sequence names since the whole file is read essentially as ASCII. Can you have a look at other libraries and try and figure whether anyone else is assuming that FASTA files are purely ASCII? Thanks!
Hello @iosonofabio, considering our experience with fasta files we can't see a reason for a sequence name having non-ASCII character. Of course we can't be sure that all fasta files will always have ASCII only characters, however since the speed difference is so considerable and the number of fasta files that will present such non-ASCII characters will probably be residual, considering the fasta files as ASCII only characters would be the best commitment (personally and for the community).
My concern is what if a say Chinese person uses characters in the FASTA label?
Can you please point me to any other library that parses FASTA as binary?
The speedup is real but refers to a corner case in which you have a LOT of reads and you do exactly nothing with them. Real life is not like that. But anyways I'm leaning towards creating a separate, fast iterator for speed sensitive cases and keep this one slow but universally compatible... That's how e.g. biopython does it, works well
On June 13, 2017 11:03:40 AM PDT, mickaelsilva notifications@github.com wrote:
Hello @iosonofabio, considering our experience with fasta files we can't see a reason for a sequence name having non-ASCII character. Of course we can't be sure that all fasta files will always have ASCII only characters, however since the speed difference is so considerable and the number of fasta files that will present such non-ASCII characters will probably be residual, considering the fasta files as ASCII only characters would be the best commitment (personally and for the community).
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/simon-anders/htseq/issues/28#issuecomment-308199873
I understand your concerns and agree with the two iterators proposal as the best possible solution.
Thanks.
awesome, give me some time to implement it thanks
fixed in 6176568, will be 0.9.0
I've been using HTSeq to read fasta files and I noticed a huge increase in time when iterating a fasta object between python 2.7 (HTSeq version 0.6.1p1) and 3.4 (HTSeq version 0.7.2).
For instance :
this task takes 0.01 sec using python 2.7 compared to 6.7 sec with python 3.4 (bacteria genome size 2.2 Mb)
Did you ever notice such behavior?