Closed jgilis closed 2 years ago
Awesome!
re: svMisc, I should change this throughout to an importFrom, I'll make this change this week.
Re: data.table, I might want to move this to a Suggests, would you be comfortable with that? I can handle the code change this week. Just because this is the one place it's being used. I figure most users will already have it installed anyway, but it's not strictly necessary for the entire functioning of the package.
Re: test data, I'm happy to add to tximportData. How large would we be talking?
I agree with your comments/suggestions on svMisc and data.table.
With respect to the test data size:
bfh.txt
files are quite large; for dropseq I have 15 BFH files of ±50Mb each, for the 10X data its 4 BFH files of ±500Mb each (more cells per file). The BFH files make up more than 90% of the object size of the salmon-alevin output folders. Since we would need at least two files, the smallest object I could provide would be 2*50Mb (without going inside the objects and removing stuff, that is).eq_classes.txt
files.Since all these data are publicly available, it would be possible to upload the quantifications to the scRNAseq
package and make them available to all. Would be cool to have an SCE object with e.g. gene-level counts as main assay, but also including transcript-level and ECC-level counts, e.g. using the altExp. [edit: we could also omit gene-level counts and just aggregate the tx-level counts]. But happy to hear other suggestions!
Best regards,
Jeroen
For tximportData
, I can trim ~200 Mb from the alevin
directory because 1) we have some legacy testing data v0.12 that can be removed 2) your new alevin output can replace the current testing data.
So if you can share 2 x 50 Mb output directories that would be great, and I will work on plugging that into tximportData
and replacing testing code in fishpond and tximport. Specifically:
https://github.com/mikelove/tximport/blob/master/tests/testthat/test_alevin.R
https://github.com/mikelove/fishpond/blob/master/tests/testthat/test_readEDS.R
I can also take the Salmon output and add that into tximportData.
I like to avoid AnnotationHub in the unit tests if possible, because occasionally it's not available, but agree it would be nice to have the SCE in scRNAseq for others to use.
By the way, I'm discussing with Avi to move readEDS
from fishpond to a new package called eds
in the next devel cycle (I would transfer GH repo to him first, as he is the creator). This is because readEDS
is the only C++ code in fishpond and transferring it would help make fishpond a simpler install.
Hi Mike,
PR for including alevin-EC, a function to generate an equivalence class (EC) count matrix from alevin-fry output, more specifically from raw
bfh.txt
files that are generated by running alevin with the optional--dumpBfh
flag. Note thatalevinEC.R
is 30% faster and somewhat cleaner than the code I originally sent over email (essentially by making better use of thesparseMatrix
structure).Some notes, TODOs, points of discussion:
alevinEC.R
does not have examples yet because there are nobfh.txt
files available from the current packaged alevin output objects. So before I can have examples, we should discuss which for data we will generatebfh.txt
files and in which package/hub these data should be included.svMisc::progress
. I saw that inswish.R
you call this function withsvMisc::progress
and not by importFrom svMisc progress. I did the same for my function, but not sure why we would do this.svMisc::progress
in the console seems to overwrite itself + it overwrites the messages. I therefore commented out the two messages and the progress of the 2nd loop (which is much faster than the first anyway).data.table
to imports, not suggests like you mentionedsparseMatrix
with barcode identifiers as column names and equivalence class identifiers as row names. The latter are still formatted as e.g. 111|112, which means the equivalence class is compatible with both the 111th and 112th transcript intx2gene.tsv
(1-indexed!). This is rather cryptic for an end-user.In addition, I am going to include the other function to import equivalence class counts from salmon output (--dumpEq) soon, but I will first check if I further can speed up the code. So it will come in a different PR if you don't mind.
Best regards,
Jeroen