pachterlab / kallisto

Near-optimal RNA-Seq quantification
https://pachterlab.github.io/kallisto
BSD 2-Clause "Simplified" License
631 stars 168 forks source link

kallisto correct : 'std::bad_alloc' #246

Open mdeloger opened 4 years ago

mdeloger commented 4 years ago

Hi,

I have an experiment wih a whitelist composed of 35 barcodes (60bp long, non standard technology in development) and when I try to run bustools correct I immediately obtain :

(/bioinfo/local/build/Centos/envs_conda/kallisto_0.46.1) -bash-4.2$ bustools correct -w whitelist.txt -o output.sorted.correct.bus output.sort.bus Found 35 barcodes in the whitelist terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped)

I tried to increase RAM to 200Gb but seems not to solve the problem.

May you help me to understand where I am wrong, please ?

Thank you in advance

pmelsted commented 4 years ago

There is an inherent 32bp limit on barcodes, this captures most technologies out there since the barcode can be defined as only the variable parts, i.e. spaced by fixed sequences. The whitelist.txt barcodes then have to match only the variable part of the original barcode sequence.

If you can send me the whitelist.txt file and the -x string used I can take a look

mdeloger commented 4 years ago

Hi @pmelsted ,

thank you for your answer :-) In fact, our barcode is a 60bp long composed by a random combination of 3 x 16bp-long sequences (each picked among 96 possibilities) and separated by a 4bp fixed sequence so at minimum we need to look at 56bp region.

I used the -x -x 1,0,60:1,80,88:0,0,0.

Here attached an example of (short) whitelist as the full one is composed of more than 800.000 sequences

Thank you very much whitelist.txt

lakigigar commented 4 years ago

The number of barcode possibilities with barcodes of length 32 is 1.8*10^19. This is 10 million trillions... which we figured was enough... for now and for a while. Using 32 bases for barcodes allows us to represent each as a 64 bit number, which is convenient and reduces memory requirements when working with BUS. We could consider extensions that allow for your structure, but perhaps more practical is to select out 32 positions from among your 48, and then working with those 32bp barcodes instead.

mdeloger commented 4 years ago

Yes, you are right, sorry for that :-/

Actually, I do not know exactly the reason they have chosen this experimental design but I will ask them.

If in fact is it possible for you guys to make it works :

or

Thank you very much in advance for your help

pmelsted commented 4 years ago

The quick work around is to "stare" at the 96 variable 16bp sites and pick 10bp parts from each that are at a long hamming distance from each other. Anything with distance 3 or more should be just fine.

mdeloger commented 4 years ago

There is an «easy» way to do this work around ?