neurorestore / Libra

MIT License
145 stars 23 forks source link

min_feature option in run_de not working #29

Closed Peter-bbh closed 2 years ago

Peter-bbh commented 2 years ago

Hi,

As I understand it, the min_feature option can be used to avoid testing genes with none or few counts, thereby potentially increase the power of FDR.

However, there seems to be a problem with handling missing Cell types due to filtering.

Example:

data("hagai_toy") DE = run_de(hagai_toy, min_features = 1) [1] "bone marrow derived mononuclear phagocytes" DE = run_de(hagai_toy, min_features = 6) [1] "bone marrow derived mononuclear phagocytes" Error in group_by(): ! Must group by variables found in .data. ✖ Column cell_type is not found. Run rlang::last_error() to see where the error occurred.

jordansquair commented 2 years ago

min_features reflects the minimum number of replicates (or cells) expressing the gene. In this case there are only 6 replicates in the toy dataset so you are just removing all genes.

Peter-bbh commented 2 years ago

Thanks for the answer, but I'm still confused.

I thought min_feature represented the minimum total transcript counts per gene. The help section reads: "min_features the minimum number of counts for a gene to retain it. Defaults to 0". Is it possible to filter by the minimum total transcript counts per gene?

But even if min_feature is the minimum number of cells expression the gene, then most genes are expressed in 6 replica, see example below. Why do min_feature=5 work, but not min_feature=6?

"> to_pseudobulk(hagai_toy)[[1]][1:6,] [1] "bone marrow derived mononuclear phagocytes" mouse1:lps4 mouse2:lps4 mouse3:lps4 mouse1:unst mouse2:unst mouse3:unst 0610009B22Rik 14 26 27 36 41 42 0610009L18Rik 4 7 4 4 6 7 0610010F05Rik 9 25 19 17 20 17 0610010K14Rik 4 17 15 5 10 17 0610012G03Rik 62 132 105 126 237 232 0610030E20Rik 5 22 29 9 22 29

Peter-bbh commented 2 years ago

Just to follow-up.

I managed to followed the code and now think I understand what min_feature does: Count total number of cells with at least one count for the gene and then retain only those genes with more than min_feature.

However, I believe then the help section is missleading. At least for me, a minimum limit is inclusive, that is when setting min_feature=6 it would include those with 6 and not just 7 and above? In fact this interpretation is also what has been used for min_reps using ">=" versus ">" for min_feature in to_pseudo_bulk

Also, "counts" is ambigous in RNA seq, so this should be specified.

May I suggest: 1) to change the help text to ""min_features the minimum number of expressing cells for a gene to retain it. Defaults to 0" 2) change the code to include min_feature, that is change "keep_genes = rowSums(mat_mm > 0) > min_features" in to_pseudo_bulk to "keep_genes = rowSums(mat_mm > 0) >= min_features"?

Anyway, thanks for a very useful packages (and article) :)

jordansquair commented 2 years ago

Thanks. I've pushed an update.

Peter-bbh commented 2 years ago

Great, thanks again :)

hjh-air-cond commented 1 year ago

Sorry, I'm still confusing about that.

I try to set a DE workflow with Libra::run_de and which could filiter some genes that barely expressed(like the genes I marked): 图片

Those genes comes from Libra::run_de and flitered by logFC and pval, but only have like 120 out of 30,000 cell expressd. So I thought those genes might not be meaningful biologically and could affect the further analysis.

And I think the min_features parameter like the min.pct in Seurat::FindMarkers, but based on you guys' disscussion, which seems should be set to less than the total replicates when using pseudobulk methods.

Does that means like, when I use single-cell methods like MAST, min_features could be set to a big number not related to the replicates, like 1,000? Also, could there be a fuction which could help fliter gene by expression per cells under pseudobulk methods?

Thanks!

jordansquair commented 1 year ago

We will look to add something like this in an updated version. Although you can just subset the matrix yourself prior to run_de.