vals / umis

Tools for processing UMI RNA-tag data
MIT License
129 stars 33 forks source link

demultiplexing #2

Closed roryk closed 8 years ago

roryk commented 8 years ago

Hi Valentine,

What do you think about adding an option to demultiplex the barcodes into separate files, named by the barcode? We could also pass along a file of allowed barcodes to match and filter out non-matching barcodes as we go. I don't want to muck up your repo with functionality you weren't intending though.

vals commented 8 years ago

I figured it was quicker to just do the counting without demultiplexing if possible, there are already tools for demultiplexing.

If you think it would be beneficial to have a demultiplexing subcommand, it wouldn't hurt to have it there.

It should be noted though that at the moment I'm handling demultiplexing by exact matching. With data I've handled, this "only" throws away 3% of the reads. Meanwhile, other demultiplexing tools do it in a way that allows some errors.

The file of allowed barcodes is already implemented, this is the --cb_filter option, I use it a lot for e.g. MARS-Seq and CEL-Seq data.

roryk commented 8 years ago

Thanks Valentine, what do you think about having the cb_filter option be a subcommand to decouple it from tagcount? So you can do like `umis fastqtransform foobar | umis cb_filter --barcode-list barcodes - | do streaming alignment' with a cleaned file or what not.

vals commented 8 years ago

That's a great idea! This will avoid having to put a bunch if checks in every iteration of the tallying loop.