morispi / LRez

Standalone tool and library allowing to work with barcoded linked-reads
GNU Affero General Public License v3.0
12 stars 5 forks source link

Include barcode integer suffix in index. #8

Open pontushojer opened 1 year ago

pontushojer commented 1 year ago

Relates to #6.

As noted in the longranger docs (below) the suffix number can be any integer, not just "-1", as it is mean to allow for merging of different 10X libraries into the same BAM.

The BX tag includes a suffix with a dash separator followed by a number: AGAATGGTCTGCATCG-1 This number denotes what we call a GEM group, and is used to virtualize barcodes in order to achieve a higher effective barcode diversity when combining samples generated from separate GEM chip channel runs. Normally, this number will be "1" across all barcodes when analyzing a sample generated from a single GEM chip channel. It can either be left in place and treated as part of a unique barcode identifier, or explicitly parsed out to leave only the barcode sequence itself.

I run into this issue when trying to run LRez index bam on a BAM with multiple libraries which resulted in the following error:

determineSequencingTechnology: Unrecognized sequencing technology. Please make sure your barcodes originate from a compatible technology or are reported as nucleotides in the BX:Z tag.

From what I can understand from the code this suffix is currently not include in the index. For LRez to work with BAMs that contain multiple libraries this would need to be fixed.

clemaitre commented 1 year ago

Hi,

Thank you for reporting this issue. We were not aware that the suffix number could be any integer, and this case has obviously not been anticipated in the code of LRez : -1 suffixes are just removed from the barcode tag (and any other integer is not recognized).

In the case of different suffix numbers in a single BAM, the expected behaviour of LRez would be to consider as two distinct barcodes two barcodes that share the same nucleotide barcode sequence but have different suffix numbers, is that correct ?

This may not be straightforward to implement in LRez, since LRez assumes all barcodes are purely nucleotide sequences and then encodes them into integers with a 2bit encoding. The suffix numbers could be converted to nucleotide words appended to the barcodes, but this would cost extra space for vast majority of the datasets with only the "-1" suffix, and to optimize the extra space, we would need to know in advance the maximal number of different integer suffixes for the given sample.

Do you have an idea of this maximal number of different integer suffixes in practice ? In your opinion, does this situation (BAMs with multiple 10X libraries) occur frequently in practice ?

Note that a temporary (though not very neat or practical) solution is be to pre-process the BAM by replacing -X suffixes by short nucleotide words specific to each library.

Best, Claire

pontushojer commented 1 year ago

Thanks for the quick reply!

In the case of different suffix numbers in a single BAM, the expected behaviour of LRez would be to consider as two distinct barcodes two barcodes that share the same nucleotide barcode sequence but have different suffix numbers, is that correct ?

Yes this is correct as the same nucleotide barcode sequence could have been sampled in multiple library preparations.

This may not be straightforward to implement in LRez, since LRez assumes all barcodes are purely nucleotide sequences and then encodes them into integers with a 2bit encoding. The suffix numbers could be converted to nucleotide words appended to the barcodes, but this would cost extra space for vast majority of the datasets with only the "-1" suffix, and to optimize the extra space, we would need to know in advance the maximal number of different integer suffixes for the given sample.

Do you have an idea of this maximal number of different integer suffixes in practice ? In your opinion, does this situation (BAMs with multiple 10X libraries) occur frequently in practice ?

I expected this might be an issue. As for the maximal number of expected suffixes I don't have a good answer here. Clearly most people only use BAMs with one suffix ("-1"). For me I have merged as much as 6 different libraries, that is 6 different suffixes in one BAM. I am however not sure how common this is for other people. A pretty safe estimate for a maximum numer of integer suffixes would probably be around 10.

Note that a temporary (though not very neat or practical) solution is be to pre-process the BAM by replacing -X suffixes by short nucleotide words specific to each library.

Yes I suppose this would be a solution.

An even simpler solution, and more practical for now, would be to just ignore any barcode suffix for the index. I am thinking that this would be a ok solution for now as I am not sure this is a big problem for other users. Also one could always confirm which suffix is present on an alignment after accessing it.