thelovelab / fishpond

Differential expression and allelic analysis, nonparametric statistics
https://thelovelab.github.io/fishpond
27 stars 9 forks source link

alevin-EC #17

Closed jgilis closed 2 years ago

jgilis commented 2 years ago

Hi Mike,

PR for including alevin-EC, a function to generate an equivalence class (EC) count matrix from alevin-fry output, more specifically from raw bfh.txt files that are generated by running alevin with the optional --dumpBfh flag. Note that alevinEC.R is 30% faster and somewhat cleaner than the code I originally sent over email (essentially by making better use of the sparseMatrix structure).

Some notes, TODOs, points of discussion:

In addition, I am going to include the other function to import equivalence class counts from salmon output (--dumpEq) soon, but I will first check if I further can speed up the code. So it will come in a different PR if you don't mind.

Best regards,

Jeroen

mikelove commented 2 years ago

Awesome!

re: svMisc, I should change this throughout to an importFrom, I'll make this change this week.

Re: data.table, I might want to move this to a Suggests, would you be comfortable with that? I can handle the code change this week. Just because this is the one place it's being used. I figure most users will already have it installed anyway, but it's not strictly necessary for the entire functioning of the package.

Re: test data, I'm happy to add to tximportData. How large would we be talking?

jgilis commented 2 years ago

I agree with your comments/suggestions on svMisc and data.table.

With respect to the test data size:

Since all these data are publicly available, it would be possible to upload the quantifications to the scRNAseq package and make them available to all. Would be cool to have an SCE object with e.g. gene-level counts as main assay, but also including transcript-level and ECC-level counts, e.g. using the altExp. [edit: we could also omit gene-level counts and just aggregate the tx-level counts]. But happy to hear other suggestions!

Best regards,

Jeroen

mikelove commented 2 years ago

For tximportData, I can trim ~200 Mb from the alevin directory because 1) we have some legacy testing data v0.12 that can be removed 2) your new alevin output can replace the current testing data.

So if you can share 2 x 50 Mb output directories that would be great, and I will work on plugging that into tximportData and replacing testing code in fishpond and tximport. Specifically:

https://github.com/mikelove/tximport/blob/master/tests/testthat/test_alevin.R

https://github.com/mikelove/fishpond/blob/master/tests/testthat/test_readEDS.R

I can also take the Salmon output and add that into tximportData.

I like to avoid AnnotationHub in the unit tests if possible, because occasionally it's not available, but agree it would be nice to have the SCE in scRNAseq for others to use.

By the way, I'm discussing with Avi to move readEDS from fishpond to a new package called eds in the next devel cycle (I would transfer GH repo to him first, as he is the creator). This is because readEDS is the only C++ code in fishpond and transferring it would help make fishpond a simpler install.

https://github.com/mikelove/eds