--batch and load_metadata() duplicate the 1st metadata.tsv entry

Hego-CCTB commented 2 years ago

I encountered a bug while using getfastq --batch. It doesn't matter which number i put after --batch, it will always process the first entry from the metadata sheet (so , getfastq --batch 1, getfastq --batch 30 and getfastq --batch 4035 all download the same SRA-ID). Since this problem occurs in load_metadata(), I assume other amalgkit functionalities are affected by this as well.

I will push a fix to this shortly.

This is load_metadata() in utils.py

def load_metadata(args):
    df = pandas.read_csv(args.metadata, sep='\t', header=0)
    metadata = Metadata.from_DataFrame(df)
    if 'batch' in dir(args):
        if args.batch is not None:
            print('--batch is specified. Processing one SRA per job.')
            is_sampled = numpy.array([])
            for idx in metadata.df.index:
                is_sampled = numpy.append(is_sampled, strtobool(metadata.df['is_sampled'].loc[idx]))
            txt = 'This is {:,}th job. In total, {:,} jobs will be necessary for this metadata table. {:,} SRAs were excluded.'
            print(txt.format(args.batch, sum(is_sampled), len(numpy.where(is_sampled == False)[0])))
            if args.batch>sum(is_sampled):
                sys.stderr.write('--batch {} is too large. Exiting.\n'.format(args.batch))
                sys.exit(0)
            if is_sampled.sum()==0:
                print('No sample is "sampled". Please check the "is_sampled" column in the metadata. Exiting.')
                sys.exit(1)
            metadata.df = metadata.df.loc[is_sampled,:]
            metadata.df = metadata.df.reset_index()
            metadata.df = metadata.df.loc[[args.batch-1,],:]
    return metadata

The source of this problem seems to be how is_sampled is constructed and used later on:

for idx in metadata.df.index:
                is_sampled = numpy.append(is_sampled, strtobool(metadata.df['is_sampled'].loc[idx]))

Suppose we have a metadtata.tsv with 4 samples. All samples have Yes as a value in the 'is_sampled' column. In it's current form, is_sampled will be a list [1,1,1,1]

This takes effect in this line:

metadata.df = metadata.df.loc[is_sampled,:]

# -> metadata.df = metadata.df.loc[[1,1,1,1],:]

This just duplicates the first entry of metadata.df 4 times and overwrites metadata.df with the duplicates. The correct is_sampled list should look like this in this hypothetical example: [1,2,3,4] and [1,2,4] if the 3rd entry would have a 'no' in the is_sampled column.

EDIT: is_sampled is now a vector of boolean values, i.e. [True, True, False, True] and works as intended

Hego-CCTB commented 2 years ago

metadata.df = metadata.df.loc[numpy.where(is_sampled == True)[0],:]

seems to do the trick. Will make a couple of test runs to confirm and push the fix.

kfuku52 commented 2 years ago

We should fix it in a better way. As the variable name implies, is_sampled should be a boolean list, and a boolean list is no problem to use in pd.DataFrame.loc. This should be done e.g., by replacing...

 is_sampled = numpy.array([])
            for idx in metadata.df.index:
                is_sampled = numpy.append(is_sampled, strtobool(metadata.df['is_sampled'].loc[idx]))

with

is_sampled = numpy.array([strtobool(yn) for yn in df.loc[:,'is_sampled']],dtype=bool)

Does it work?

Hego-CCTB commented 2 years ago

First try looks good!

This is a much better solution, as it keeps is_sampled sane.

Hego-CCTB commented 2 years ago

should be fixed in https://github.com/kfuku52/amalgkit/commit/52fc9d2397abd47cce61654f627399f6bf2cb7d0 amalgkit version 0.6.5.4

kfuku52 / amalgkit

--batch and load_metadata() duplicate the 1st metadata.tsv entry #90