kfuku52 / amalgkit

RNA-seq data amalgamation for a large-scale evolutionary transcriptomics
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

pandas 1.5 #106

Closed kfuku52 closed 1 year ago

kfuku52 commented 1 year ago

With pandas 1.5 or newer, amalgkit prints numerous warnings like this:

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[:,'is_sampled'] = 'No'
/Users/kf/miniconda3/lib/python3.9/site-packages/amalgkit/metadata.py:522: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

@Hego-CCTB Could you fix it? For this particular example, you can simply replace

df.loc[:,'is_sampled'] = 'No'

with

df['is_sampled'] = 'No'
Hego-CCTB commented 1 year ago

On it!

Hego-CCTB commented 1 year ago

Something unrelated I found in this part of the code is that we are inconsistent in the capitalization of Yes and No

df['bioproject'] = df['bioproject'].fillna('unknown').values
        df['is_sampled'] = 'No'
        df['is_qualified'] = 'No'
        df.loc[(df['exclusion'] == 'no'), 'is_qualified'] = 'Yes'
        while len(df.loc[(df.loc[:,'is_sampled']=='Yes')&(df.loc[:,'exclusion']=='no'),:]) < target_n:
            if len(df) <= target_n:
                df.loc[(df.loc[:,'exclusion']=='no'), 'is_sampled'] = 'Yes'
                break
            else:
                df_unselected = df.loc[(df.loc[:,'is_sampled']=='No')&(df.loc[:,'exclusion']=='no'),:]
                bioprojects = df_unselected.loc[:,'bioproject'].unique()

It's probably better to have either have everything either in lower case or capitalized.

kfuku52 commented 1 year ago

Shall we use lowercase?

Hego-CCTB commented 1 year ago

Sounds good, I'll make the change in the same update.

Hego-CCTB commented 1 year ago

Hm, on my end I get the SettingWithCopyWarning for both cases. From the debug console:

df.loc[:,'is_sampled'] = 'no'
<ipython-input-16-340eba43ef02>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[:,'is_sampled'] = 'no'

And

df['is_sampled'] = 'no'
<ipython-input-15-aadbd928c850>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['is_sampled'] = 'no'

Pandas is at 1.5.1, though my python is still at 3.8. From the same debug console:

pandas.show_versions()
INSTALLED VERSIONS
------------------
python           : 3.8.2.final.0
pandas           : 1.5.1
kfuku52 commented 1 year ago

Hmm... what does it say with df['is_sampled'] = numpy.array(['no',] * df.shape[0])?

Hego-CCTB commented 1 year ago

No dice

df['is_sampled'] = numpy.array(['no',] * df.shape[0])
<ipython-input-24-a7a77dc7e8d0>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Hego-CCTB commented 1 year ago

From what I can gather investigating the warning is that the returning object might be a copy so this may not change the actual df. What I don't understand is why that is happening. From an answer from Stack Overflow from last year:

We don't really know. pandas makes no guarantees about what returns from df.getitem(...), and under the hood, the memory layout of the stored array can result in a view or a copy. In general, when you work on a dataframe with a single dtype, you get a view but that's not guaranteed. I believe they are working on a new approach instead of using BlockManager which is the main source of these issues.

kfuku52 commented 1 year ago

OK, if there is no workaround, we could go with a dirty way for now like:

pandas.set_option('mode.chained_assignment', None)
df['is_sampled'] = 'no'
pandas.set_option('mode.chained_assignment', 'Warn')
Hego-CCTB commented 1 year ago

What's curious to me is, that working on self doesn't throw a warning:

self.df.loc[:, 'is_sampled'] = 'no'

That's of course not what we want, since the df used in this method is a subset of self.df. But for some reason, as soon as I create a subset of self within this method I will get the warning when trying to change values in that subset. When I create a completely independent dataframe and manually filling it with values I get no warning. It's very strange and I feel like I'm close to figuring it out, but I'm out of ideas.

I'll disable the warnings for this method for the time being, as you suggested.

Hego-CCTB commented 1 year ago

Pushed in the last commit: https://github.com/kfuku52/amalgkit/commit/9093f13eff626308b1656a408d582686da99547b

kfuku52 commented 1 year ago

Thanks!

kfuku52 commented 1 year ago

Your Yes/No change is not complete. See here for an example. https://github.com/kfuku52/amalgkit/blob/64b27c12477895eb173a2d3683d193372d1e94e3/amalgkit/integrate.py#L87-L88

When you want to change something, please make sure to change all relevant parts in the entire project. In this particular case, you should search for is_sampled and is_qualified to inspect where your change may influence.

Hego-CCTB commented 1 year ago

True, that's an oversight on my part. Found an instance in merge as well. Should be consistent now: https://github.com/kfuku52/amalgkit/commit/4fe2b606f3b95637c55ca4fadd3547e633c36adb