samtools / htslib

C library for high-throughput sequencing data formats
Other
813 stars 446 forks source link

Feature request: Add option to tabix to interleave multiple bgzipped indexed files, without the need for external resorting. #1338

Open ghuls opened 3 years ago

ghuls commented 3 years ago

Add option to tabix to interleave multiple bgzipped indexed files, without the need for external resorting.

# bgzipped file*.bed.gz files with index.

# "Concatenate" indexed bgzipped files so they result in a sorted bgzip output file.
tabix --cat -o output.bed.gz file1.bed.gz file2.bed.gz ...
jkbonfield commented 3 years ago

Just to be clear, are you calling this "cat" as file1.bed.gz, file2.bed.gz, file3.bed.gz etc are internally sorted with the first item in file2 coming after the last in file1, and so on? (Eg chr1.bed, chr2.bed, etc)? I could understand that being called "cat". (Edit: but that doesn't feel like interleaving to me.)

Or are you saying that file1.bed, file2.bed,... are sorted and may overlap, and the function just takes the next record from each, whichever should be sorted first? This is the impression I get when you say "interleave". If so that's "merge" and not "cat". Both choices have valid use cases IMO.

ghuls commented 3 years ago

Merge is indeed the word I was looking for. file1.bed and file2.bed are sorted, combine them so the output is still sorted (by only sorting the needed blocks and not the whole files).

So this can be avoided (mainly the full sort step):

zcat file1.bed.gz file2.bed.gz | sort -k 1,1 -k 2,2n -k 3,3n | bgzip ... 

So basically samtools merge but for non BAM/CRAM files.

jmarshall commented 3 years ago

I initially had the impression you meant a way to make a single index file that would enable queries over the set of input files (in place), doing some kind of gathered interleaving of records read directly from the individual input files. While just about plausible, this would be rather impractical to implement!

A form of samtools merge for the sort of generic genome-position-sorted files that tabix indexes is a more plausible beast. While it could be implemented as a separate tool, it would also fit into/alongside tabix without much difficulty.

Because tbx_itr_querys(tbx. ".") (assuming "." works for tbx indexes!) returns curr_tid/curr_beg/curr_end, it would be fairly easy to do the comparison between records to choose which record to output into the merged output first, at least for a.curr_beg <=> b.curr_beg.

The major wrinkle is in comparing tid values. Because at least in theory, file1.bed.gz, file2.bed.gz, … fileN.bed.gz may each contain different subsets of the complete set of chromosomes, the integer tid values are not necessarily meaningfully comparable between different input files. So you would need to do some kind of clever building of and lookup into a merged chromname→tid dictionary, or (easier) also require an option providing a file (e.g. an fai file or bedtools genome file) giving the desired ordering of chromosomes.

jkbonfield commented 3 years ago

It's an interesting one. For SAM and VCF of course there are already dedicated merge tools. We could do another for BED, but tabix is totally generic. It has configurations stating which fields to index on.

However it strikes me there is one thing missing here in that configuration - collation order. I think tabix can index something working on the assumption that the data is sorted and therefore it's just recording the change from one chromosome to another, but I don't think it would care if that order was chr 1,2,3,...9,10,11 or chr 1,10,11,12...,19,2,20,21,... (ie ascii sort order rather than alphanumeric). That works because the entire index is loaded into memory and a genome chromosome query just fetches that part of index. Maybe I'm wrong here, but I'm pretty sure our indices have no notion of collation.

However, you do need to know the collation order when you're merging. For SAM and VCF that's done by the order of the SQ or contig lines in the headers, but no such thing exists in BED (for example), or any other arbitrary tab delimited file. Therefore this would require some substantial tabix additions I think, either a general purpose header parser capability of extracting file specific collation order, or a general purpose mechanism to describe collation orders.

Edit: I guess if the files you're merging have themselves already been indexed, then you have a file specific collation order already defined in each index. I guess that's what John is referring to by the tid comments. As he explains, you'd need to map back to some unified numbering so they can be sorted properly.

ghuls commented 3 years ago

@jkbonfield would that still be true when usage of this feature would require index files to be there (and get the ordering from there)? In the worst case, the user provides an external ordered list (similar to the chromosome file needed for bedtools).

jkbonfield commented 3 years ago

If we require it to be indexed first, then it becomes an easier proposal and no additional collation order information is needed. (See my dawning realisation in the "Edit:" comment above). Thinking further, I'm assuming now you wanted this to work on regions too (hence indexed files). As you didn't mention this I was thinking of whole files without, but as you comment you can just do that with unix sort anyway so why would we even need a new tool; other than time complexity as merging is faster than sort, but not necessarily hugely so.

A new tool however really helps if you want to do this on parts of the data, which implies indices. That's a (simpler) on-the-fly alternative to the multi-file indexing problem that John mentioned above. I can see the benefit of having such a tool.

It's not a totally trivial bit of work though, so I don't see this feature arriving in the immediate future at least.

ghuls commented 3 years ago

As you didn't mention this I was thinking of whole files without, but as you comment you can just do that with unix sort anyway so why would we even need a new tool; other than time complexity as merging is faster than sort, but not necessarily hugely so.

In my case the bgzipped BED files are several GB each. I assumed a merging approach should be much faster than a naive full sort of the data.

jkbonfield commented 3 years ago

Maybe :-) It feels more like an improved version of Unix comm.

Anyway in the short term you should probably use bedtools merge:

https://bedtools.readthedocs.io/en/latest/content/tools/merge.html

ghuls commented 3 years ago

bedtools merge does something completely different. First you need one file as input, secondly it will merge overlapping intervals to a new bed entry, not "sorting" input intervals and preserving them.

ghuls commented 11 months ago

If we require it to be indexed first, then it becomes an easier proposal and no additional collation order information is needed. (See my dawning realisation in the "Edit:" comment above). Thinking further, I'm assuming now you wanted this to work on regions too (hence indexed files). As you didn't mention this I was thinking of whole files without, but as you comment you can just do that with unix sort anyway so why would we even need a new tool; other than time complexity as merging is faster than sort, but not necessarily hugely so.

I was looking for using it on whole files.

An example for which it would be useful, would be to merge fragments.tsv.gz (BED with count in column 5) files from scATAC data from different sequencing runs, which are already sorted in the same chromosome order as those files are created directly from mapped BAM files.

It's not a totally trivial bit of work though, so I don't see this feature arriving in the immediate future at least.

Any chance it would be considered now?

jkbonfield commented 11 months ago

Unfortunately we've got 2 new projects on the go in addition to samtools (and the team hasn't been given any more resources) so our resources are stretched. So we can't commit to any specific time line, if ever sadly.

It is a reasonable idea though.

ghuls commented 11 months ago

It is a pity your are not getting more resources. Hopefully the situation improves in the future.