Closed JMencius closed 2 months ago
Hi @JMencius,
Thanks for the PR this looks really useful. However, I think this code will break when a bam file is streamed to dysgu e.g.:
samtools view -bh test.bam chr1:1-50000 | dysgu call --buffer-size 100000 hg19.fa wd -
This is because a stream file cannot be reset. If you move part of your code to coverage.pyx
, you can use the stream in get_read_properties
function https://github.com/kcleal/dysgu/blob/master/dysgu/coverage.pyx#L329 - this will keep the stream open.
Also the error will still not be caught in the dysgu run pipeline - to catch it you would have to add some c++ code to find_reads.hpp https://github.com/kcleal/dysgu/blob/master/dysgu/find_reads.hpp#L285 I think just checking the header file for the SO tag would be enough for this case. Also im happy to accept the PR if only the python code is implemented.
Much appreciated!
Hi @kcleal
Yeah, that is really tricky to deal with a stream file. Maybe I will just kept the code for checking SO
and move it to coverage.pyx
. Am I get it right?
Hi @kcleal
If I get you points right, I move almost all the code I wrote from cluster.pyx
to coverage.pyx
, without reseting the stream.
I still use a redundant way to check if the bam file is aligned, because I check many unsorted bam files from guppy
or dorado
produce by minimap2
. Many of them do not have the SO
tag. I think it is always safer to check the index rather than just the SO
tag.
I also run the dysgu test
and tested on an unsorted bam file, everything looks good.
I did not modified the C++ code becuase I am bad in C++ sadly.
Hope you like my latest commit.
Looks good thanks! I will get it merged
Update cluster.pyx
I use two redundant ways to check if bam file is sorted as mentioned in issue #84.
One is just directly check the SO tag in HD, but some bam file do not have this tag and is still can be unsorted, so I also employ an alternative way to just check the first 1K reads to see if it is sorted.
If the input bam file in unsorted, an Error is raise. If it is sorted, a log is create to log the pass check.
I also run
dysgu test
and all of them pass.Additionally, I try to run on some unsorted bam file, it works like a charm.
Hope you like my modifications.