Closed Hego-CCTB closed 2 years ago
metadata.df = metadata.df.loc[numpy.where(is_sampled == True)[0],:]
seems to do the trick. Will make a couple of test runs to confirm and push the fix.
We should fix it in a better way. As the variable name implies, is_sampled
should be a boolean list, and a boolean list is no problem to use in pd.DataFrame.loc
. This should be done e.g., by replacing...
is_sampled = numpy.array([])
for idx in metadata.df.index:
is_sampled = numpy.append(is_sampled, strtobool(metadata.df['is_sampled'].loc[idx]))
with
is_sampled = numpy.array([strtobool(yn) for yn in df.loc[:,'is_sampled']],dtype=bool)
Does it work?
First try looks good!
This is a much better solution, as it keeps is_sampled
sane.
should be fixed in https://github.com/kfuku52/amalgkit/commit/52fc9d2397abd47cce61654f627399f6bf2cb7d0 amalgkit version 0.6.5.4
I encountered a bug while using
getfastq --batch
. It doesn't matter which number i put after--batch
, it will always process the first entry from the metadata sheet (so ,getfastq --batch 1
,getfastq --batch 30
andgetfastq --batch 4035
all download the same SRA-ID). Since this problem occurs inload_metadata()
, I assume otheramalgkit
functionalities are affected by this as well.I will push a fix to this shortly.
This is
load_metadata()
inutils.py
The source of this problem seems to be how
is_sampled
is constructed and used later on:Suppose we have a metadtata.tsv with 4 samples. All samples have
Yes
as a value in the 'is_sampled' column. In it's current form,is_sampled
will be a list [1,1,1,1]This takes effect in this line:
This just duplicates the first entry of
metadata.df
4 times and overwritesmetadata.df
with the duplicates. The correctis_sampled
list should look like this in this hypothetical example: [1,2,3,4] and [1,2,4] if the 3rd entry would have a 'no' in the is_sampled column.EDIT:
is_sampled
is now a vector ofboolean
values, i.e. [True, True, False, True] and works as intended