parashardhapola / scarf

Toolkit for highly memory efficient analysis of single-cell RNA-Seq, scATAC-Seq and CITE-Seq data. Analyze atlas scale datasets with millions of cells on laptop.
http://scarf.readthedocs.io
BSD 3-Clause "New" or "Revised" License
96 stars 12 forks source link

How do you reset the column "I" ? #63

Closed m0nib closed 3 years ago

m0nib commented 3 years ago

I loaded dataset as scarf.DataStore with the zarr directory then ran through the pipeline, i guess i made an error in the filter_cells function, now all the feats in column I are labelled False.

Therefore, when running mark_hvgs i keep getting an error:

ds.mark_hvgs( min_cells=20, top_n=500, min_mean=3, max_mean=2, max_var=6 )

Error: WARNING: WARNING: Number of valid features are less then value of parameter top_n: 500. Resetting top_n to 0


IndexError Traceback (most recent call last) /tmp/ipykernel_32217/1850153007.py in ----> 1 ds.mark_hvgs( 2 min_cells=20, 3 top_n=500, 4 min_mean=3, 5 max_mean=2,

~/anaconda3/envs/scarf_env/lib/python3.8/site-packages/scarf/datastore.py in mark_hvgs(self, from_assay, cell_key, min_cells, top_n, min_var, max_var, min_mean, max_mean, n_bins, lowess_frac, blacklist, show_plot, hvg_key_name, **plot_kwargs) 3565 f"of cells will be considered HVGs." 3566 ) -> 3567 assay.mark_hvgs( 3568 cell_key, 3569 min_cells,

~/anaconda3/envs/scarf_env/lib/python3.8/site-packages/scarf/assay.py in mark_hvgs(self, cell_key, min_cells, top_n, min_var, max_var, min_mean, max_mean, n_bins, lowess_frac, blacklist, hvg_key_name, show_plot, **plot_kwargs) 932 top_n = n_valid_feats - 1 933 min_var = ( --> 934 pd.Series(self.feats.fetch_all(col_renamer(c_var_col)))[idx] 935 .sort_values(ascending=False) 936 .values[top_n]

IndexError: index -1 is out of bounds for axis 0 with size 0

Could you advise on how to reset the column I or do i have to restart the piepline from re-creating the zarr files?

Thank you best regards Monib

parashardhapola commented 3 years ago

Hi,

Can you please tell me the output of the following commands:

ds.RNA.feats.fetch_all('I').sum()

and of this:

ds.cells.fetch_all('I').sum()

Also, if you print out ds in a notebook cell, what do you see?

parashardhapola commented 3 years ago

If you want to reset all cell level filtering, then this will do: ds.cells.reset_key('I')

If you want to reset all the feature level filtering, then you need this: ds.RNA.feats.reset_key('I')

m0nib commented 3 years ago

Hi,

Can you please tell me the output of the following commands:

ds.RNA.feats.fetch_all('I').sum()

and of this:

ds.cells.fetch_all('I').sum()

Also, if you print out ds in a notebook cell, what do you see?

Hi Parashardapola,

output from these commands looks like this::

ds.RNA.feats.fetch_all("I").sum()

56490

time: 9.83 ms (started: 2021-09-30 09:41:48 +02:00)

ds.cells.fetch_all('I').sum()

973667

time: 13.1 ms (started: 2021-09-30 09:42:03 +02:00)

ds

DataStore has 973667 (982538) cells with 1 assays: RNA Cell metadata: 'I', 'ids', 'names', 'RNA_nCounts', 'RNA_nFeatures', 'RNA_percentMito', 'RNA_percentRibo', 'annotation.l1', 'annotation.l2', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'dataset_origin', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor', 'ethnicity', 'ethnicity_ontology_term_id', 'health_status', 'original_annotation', 'predicted.annotation.l1.score', 'predicted.annotation.l2.score', 'prediction_score_l1_bin', 'prediction_score_l2_bin', 'sex', 'tissue', 'tissue_ontology_term_id',

RNA assay has 56490 (77521) features and following metadata: 'I', 'ids', 'names', 'dropOuts', 'nCells', 'stats_I_avg', 'stats_I_c_var2000.1', 'stats_I_normed_n', 'stats_I_normed_tot', 'stats_I_nz_mean', 'stats_I_sigmas'

time: 16.7 ms (started: 2021-09-30 09:42:12 +02:00)

I will reset the keys and try again, as it failed again.

Best regards Monib

m0nib commented 3 years ago

It seems like if you re-apply the filer_cells function cells and feats both retain the previous filter key in column "I" rather than first reset the keys and then apply the filter_cells.

m0nib commented 3 years ago

HI,

I reset the cells and feats keys and ran only

ds.filter_cells(attrs=['RNA_percentMito'], highs=[15], lows=[0])

followed by mark_hvgs using

ds.mark_hvgs( min_cells=20, top_n=500, min_mean=3, max_mean=2, max_var=6 )

I am still getting the same error:

(RNA) Computing nCells: 100%| 257650/257650 [03:56] (RNA) Computing normed_tot: 100%| 257650/257650 [04:05] (RNA) Computing sigmas: 100%| 257650/257650 [04:47]

WARNING: WARNING: Number of valid features are less then value of parameter top_n: 500. Resetting top_n to 0


IndexError Traceback (most recent call last) /tmp/ipykernel_40246/1850153007.py in ----> 1 ds.mark_hvgs( 2 min_cells=20, 3 top_n=500, 4 min_mean=3, 5 max_mean=2,

~/anaconda3/envs/scarf_env/lib/python3.8/site-packages/scarf/datastore.py in mark_hvgs(self, from_assay, cell_key, min_cells, top_n, min_var, max_var, min_mean, max_mean, n_bins, lowess_frac, blacklist, show_plot, hvg_key_name, **plot_kwargs) 3565 f"of cells will be considered HVGs." 3566 ) -> 3567 assay.mark_hvgs( 3568 cell_key, 3569 min_cells,

~/anaconda3/envs/scarf_env/lib/python3.8/site-packages/scarf/assay.py in mark_hvgs(self, cell_key, min_cells, top_n, min_var, max_var, min_mean, max_mean, n_bins, lowess_frac, blacklist, hvg_key_name, show_plot, **plot_kwargs) 932 top_n = n_valid_feats - 1 933 min_var = ( --> 934 pd.Series(self.feats.fetch_all(col_renamer(c_var_col)))[idx] 935 .sort_values(ascending=False) 936 .values[top_n]

IndexError: index -1 is out of bounds for axis 0 with size 0

Not sure what is happening here or what i am doing wrong.

Could you help with this.

Thank you

Best regards Monib

parashardhapola commented 3 years ago

Hi @m0nib,

That's a correct observation. The filter_cells function will not 'filter-in' cells that were previously 'filtered out'. This design decision was put in after a lot of thought. Sometimes the users perform a manual filtering step and then use filter_cells. In such cases, automatically resetting the filtering is not a good idea.

You can always force reset the filter using reset_key as I showed above.

Additionally the filter_cells function has a parameter called reset_previous. You can set this to True to automatically reset any previous filtering. If you do this, then you don't need to manually use the reset_key that I showed above.

parashardhapola commented 3 years ago

Hi @m0nib,

There is a problem with your parameters in mark_hvgs You are using a value for min_mean that is greater than max_mean (3>2). min_mean should be lower than max_mean. I would suggest that you don't use min_mean parameter for starters and look at the results. Maybe paste the plot that you get after successfully running mark_hvgs and then I can explain more.

m0nib commented 3 years ago

Hi,

It worked now.

Thank you Best regards Monib

parashardhapola commented 3 years ago

Great! I'll close this now