kcleal / dysgu

Toolkit for calling structural variants using short or long reads
MIT License
88 stars 10 forks source link

Check if input bam file is sorted #89

Closed JMencius closed 2 months ago

JMencius commented 2 months ago

Update cluster.pyx

I use two redundant ways to check if bam file is sorted as mentioned in issue #84.

One is just directly check the SO tag in HD, but some bam file do not have this tag and is still can be unsorted, so I also employ an alternative way to just check the first 1K reads to see if it is sorted.

If the input bam file in unsorted, an Error is raise. If it is sorted, a log is create to log the pass check.

I also run dysgu test and all of them pass.

Additionally, I try to run on some unsorted bam file, it works like a charm.

$ dysgu call -x -o test_unsorted.vcf mm39.fa ~/Downloads/dysgu-master/wd_test mouse_unsorted.bam;

2024-04-14 11:40:38,494 [INFO   ]  call -x -o test_unsorted.vcf ~/dysgu-master/wd_test mouse_unsorted.bam
Traceback (most recent call last):
  File "~dysgu/bin/dysgu", line 33, in <module>
    sys.exit(load_entry_point('dysgu==1.6.3', 'console_scripts', 'dysgu')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/envs/dysgu/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/envs/dysgu/lib/python3.12/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "~miniconda3/envs/dysgu/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/envs/dysgu/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/envs/dysgu/lib/python3.12/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/envs/dysgu/lib/python3.12/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/envs/dysgu/lib/python3.12/site-packages/dysgu-1.6.3-py3.12-linux-x86_64.egg/dysgu/main.py", line 420, in call_events
    cluster.cluster_reads(ctx.obj)
  File "dysgu/cluster.pyx", line 1181, in dysgu.cluster.cluster_reads
    raise ValueError("Input bam file must be sorted")
ValueError: Input bam file must be sorted

Hope you like my modifications.

kcleal commented 2 months ago

Hi @JMencius,

Thanks for the PR this looks really useful. However, I think this code will break when a bam file is streamed to dysgu e.g.:

samtools view -bh test.bam chr1:1-50000 | dysgu call --buffer-size 100000 hg19.fa wd - 

This is because a stream file cannot be reset. If you move part of your code to coverage.pyx, you can use the stream in get_read_properties function https://github.com/kcleal/dysgu/blob/master/dysgu/coverage.pyx#L329 - this will keep the stream open.

Also the error will still not be caught in the dysgu run pipeline - to catch it you would have to add some c++ code to find_reads.hpp https://github.com/kcleal/dysgu/blob/master/dysgu/find_reads.hpp#L285 I think just checking the header file for the SO tag would be enough for this case. Also im happy to accept the PR if only the python code is implemented.

Much appreciated!

JMencius commented 2 months ago

Hi @kcleal Yeah, that is really tricky to deal with a stream file. Maybe I will just kept the code for checking SO and move it to coverage.pyx . Am I get it right?

JMencius commented 2 months ago

Hi @kcleal If I get you points right, I move almost all the code I wrote from cluster.pyx to coverage.pyx, without reseting the stream. I still use a redundant way to check if the bam file is aligned, because I check many unsorted bam files from guppy or dorado produce by minimap2. Many of them do not have the SO tag. I think it is always safer to check the index rather than just the SO tag. I also run the dysgu test and tested on an unsorted bam file, everything looks good. I did not modified the C++ code becuase I am bad in C++ sadly. Hope you like my latest commit.

kcleal commented 2 months ago

Looks good thanks! I will get it merged