Closed kfuku52 closed 2 years ago
On it!
Something unrelated I found in this part of the code is that we are inconsistent in the capitalization of Yes
and No
df['bioproject'] = df['bioproject'].fillna('unknown').values
df['is_sampled'] = 'No'
df['is_qualified'] = 'No'
df.loc[(df['exclusion'] == 'no'), 'is_qualified'] = 'Yes'
while len(df.loc[(df.loc[:,'is_sampled']=='Yes')&(df.loc[:,'exclusion']=='no'),:]) < target_n:
if len(df) <= target_n:
df.loc[(df.loc[:,'exclusion']=='no'), 'is_sampled'] = 'Yes'
break
else:
df_unselected = df.loc[(df.loc[:,'is_sampled']=='No')&(df.loc[:,'exclusion']=='no'),:]
bioprojects = df_unselected.loc[:,'bioproject'].unique()
It's probably better to have either have everything either in lower case or capitalized.
Shall we use lowercase?
Sounds good, I'll make the change in the same update.
Hm, on my end I get the SettingWithCopyWarning
for both cases. From the debug console:
df.loc[:,'is_sampled'] = 'no'
<ipython-input-16-340eba43ef02>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df.loc[:,'is_sampled'] = 'no'
And
df['is_sampled'] = 'no'
<ipython-input-15-aadbd928c850>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df['is_sampled'] = 'no'
Pandas is at 1.5.1, though my python is still at 3.8. From the same debug console:
pandas.show_versions()
INSTALLED VERSIONS
------------------
python : 3.8.2.final.0
pandas : 1.5.1
Hmm... what does it say with df['is_sampled'] = numpy.array(['no',] * df.shape[0])
?
No dice
df['is_sampled'] = numpy.array(['no',] * df.shape[0])
<ipython-input-24-a7a77dc7e8d0>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
From what I can gather investigating the warning is that the returning object might be a copy so this may not change the actual df. What I don't understand is why that is happening. From an answer from Stack Overflow from last year:
We don't really know. pandas makes no guarantees about what returns from df.getitem(...), and under the hood, the memory layout of the stored array can result in a view or a copy. In general, when you work on a dataframe with a single dtype, you get a view but that's not guaranteed. I believe they are working on a new approach instead of using BlockManager which is the main source of these issues.
OK, if there is no workaround, we could go with a dirty way for now like:
pandas.set_option('mode.chained_assignment', None)
df['is_sampled'] = 'no'
pandas.set_option('mode.chained_assignment', 'Warn')
What's curious to me is, that working on self
doesn't throw a warning:
self.df.loc[:, 'is_sampled'] = 'no'
That's of course not what we want, since the df
used in this method is a subset of self.df
.
But for some reason, as soon as I create a subset of self
within this method I will get the warning when trying to change values in that subset.
When I create a completely independent dataframe and manually filling it with values I get no warning.
It's very strange and I feel like I'm close to figuring it out, but I'm out of ideas.
I'll disable the warnings for this method for the time being, as you suggested.
Pushed in the last commit: https://github.com/kfuku52/amalgkit/commit/9093f13eff626308b1656a408d582686da99547b
Thanks!
Your Yes/No change is not complete. See here for an example. https://github.com/kfuku52/amalgkit/blob/64b27c12477895eb173a2d3683d193372d1e94e3/amalgkit/integrate.py#L87-L88
When you want to change something, please make sure to change all relevant parts in the entire project. In this particular case, you should search for is_sampled
and is_qualified
to inspect where your change may influence.
True, that's an oversight on my part.
Found an instance in merge
as well. Should be consistent now:
https://github.com/kfuku52/amalgkit/commit/4fe2b606f3b95637c55ca4fadd3547e633c36adb
With pandas 1.5 or newer,
amalgkit
prints numerous warnings like this:@Hego-CCTB Could you fix it? For this particular example, you can simply replace
with