scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.89k stars 594 forks source link

gex_only=True defaults in sc.read_10x_h5 #1949

Open LuckyMD opened 3 years ago

LuckyMD commented 3 years ago

This might be a bit of a rant, and I'm aware there are some good arguments for the way things are... but I just wasted 4 hours of my life because I wasn't aware of the default gex_only=True in sc.read_10x_h5().

Just wanted to flag that if scanpy support for multimodality becomes a thing, then this default may need to change to prevent frustration. I think moving read/write functionality to AnnData was already discussed at some point. For multimodal support, this might become more important again. If this is moved, then read_10x_h5 should probably not default to gex_only=True anymore.

Would it already be worth either making gex_only a required input? For backward compatability it could also just trigger a logging warning for now. I do think that this is quite an important thing to let people know with more and more 10X Multiome data being generated now (and CITE-seq for that matter).

chris-rands commented 3 years ago

I've ran into this before (sc.read_10x_mtx() has the same default of course). One possible issue with defaulting to gex_only=False is that someone might accidentally run a 'regular pipeline' with multi-modal data, e.g. log-normalizing RNA+protein+cell hashing counts together without first subsetting the adata based on .var["feature_types"]. By contrast, anyone who know they have multi-modal data would hopefully notice the missing the data with gex_only=True. Either way, logging warnings sounds good.

ivirshup commented 3 years ago

For context, the option was added in #334, and I think the scope for other feature types was much more limited at the time.

Would it already be worth either making gex_only a required input?

I'm not sure the gex_only argument even entirely makes sense anymore. I think a feature_type argument would make more sense. Erroring if nothing is passed and there are multiple kinds sounds reasonable to me, as multimodality should be handled explicitly.

For backwards compatibility I think deprecation warnings for a release cycle when either gex_only is used or nothing is passed and there are multiple feature types present could work.


Moving the 10x reading functions had been discussed in: #1387

LuckyMD commented 3 years ago

Thanks, @ivirshup... I 100% agree with your suggestions. Maybe it's already worth adding a warning to the read functions now in cases where feature_type has multiple values.

@chris-rands agreed that we shouldn't set the default to False. This will have multiple issues both for backward compatability and the cases you describe.

@DaneseAnna: this might be important for the new episcanpy function i proposed yesterday ;).