hasindu2008 / f5c

Ultra-fast methylation calling and event alignment tool for nanopore sequencing data (supports CUDA acceleration)
https://hasindu2008.github.io/f5c/docs/overview
MIT License
140 stars 27 forks source link

Getting the contigs of interest from the reference sequence instead of the BAM file #84

Closed andreaswallberg closed 3 years ago

andreaswallberg commented 3 years ago

Dear @hasindu2008

This is a feature request.

Having set up my fast5-indexes and BAM file, I have realized that I sometimes want to run the analyses across specific contigs (up to 20-30 thousand ones in my case), rather than the whole genome.

I am aware that the program be specified to scan only a particular sequence, but there is an substantial overhead associated with running just across a short contig and then restart the program and run it again.

I guess one way to accomplish what I want is to filter the BAM file so that it only contains mappings against the contigs of interest.

Another, much simpler method is to filter the reference genome so that it only contains the contigs of interest. In my case, also in a specific order of interest, from the most prioritized sequences to the least. This is easy to accomplish by just updating the reference genome file.

However, it seems like the program defaults to scanning the data according to the order of contigs in the BAM file, not in the reference genome file.

Would it be possible to tweak the code so that the program gets the target contigs for analyses, and their order of analysis, from the reference file instead of the BAM file?

This would make the program very flexible and accommodate various ad-hoc subsets for analysis, if needed.

hasindu2008 commented 3 years ago

@andreaswallberg

Currently, f5c uses an hts iterator in htslib (_sam_itrqueryi) that iterates through a given chr:beg-end region. When a batch of alignment records are loaded, the corresponding sequences are located from the reference genome through faidx. Doing vice versa is possible but require some code restructuring. An easier method that I can think of is using _sam_itrregions in htslib that accepts a list of regions (have to look further into it to be sure). What would be the best way to accept the list of the required regions - a bed file to the -w option?

I will try my best to get this feature onto the next f5c release. Thank you for the suggestion.

hasindu2008 commented 3 years ago

Hi @andreaswallberg

Finally, I managed to implement a feature for this in f5c. This is not yet thoroughly tested, but the experimental version is available in the branch https://github.com/hasindu2008/f5c/tree/multi_region2. Also, some compiled binaries are attached herewith. It would be great if you could give it a try and let me know if this suits your use case.

In this version, you can provide a bed file with region windows to f5c as -w reg.bed.

f5c-v0.6-33-g063bc74-binaries.tar.gz

hasindu2008 commented 3 years ago

In the latest release of f5c, now you can provide a list of regions as a bed file (-w regions.bed). Closing this issue for now. Feel free to reopen this issue if you have any more thoughts. Suggestions are always appreciated :)