Closed kfuku52 closed 2 years ago
ah, this is remnant code. I'll give it a proper --batch function and remove this part. Will fix this today.
Sounds good, thank you! Just to make sure, --batch is for species-wise processing. When --batch is None
, amalgkit should iterate the curation of all species one by one in a single curate
command.
yes!
I will also implement this assuming amalgkit merge
has been run before, so all the quantified data is in one place.
curate
input dir (--infile_dir) should be merge
output dir or cstmm
output dir. I just found there are options to specify files (--infile and --eff_len_file) rather than directories. This is weird, we only need input directory paths.
That's true, we don't need infile/eff_len_file anymore. These were originally used in case someone used custom filenames for these (this is code from when merge
was not really implemented yet and curate
didn't use the metadata sheet).
I'll change these to the directory path.
I won't be able to use load_metadata()
from util.py
for this, since it only retains one line in --batch
mode. I will write a new function for this.
My idea would be:
--batch
curate
looks really strange now... if args.batch is None:
and if args.batch is not None:
look switched. Could you check and fix it if necessary?
Your plan looks good, but there should be no reason to avoid load_metadata()
.
it only retains one line in --batch mode.
What does it mean? curate
should be a species-wise process (with --batch), not an SRA-wise process like getfastq
or quant
.
What does it mean?
curate
should be a species-wise process (with --batch), not an SRA-wise process likegetfastq
orquant
.
Yes, that's what I meant. We use load_metadata()
for all the other subfunctions, for which the function only extracts a single line from the metadata to process (if --batch is given). For curate
we need to change that behaviour to extract all lines from a certain species.
I first thought about modifying the existing load_metadata()
, but I think it's gonna be easier to just write a new function.
I don't get what you mean still, but wouldn't simple processing like this work after load_metadata()
?
spp = metadata.df.loc[:,'scientific_name'].drop_duplicates().values
sp = spp[args.batch-1]
is_sp = (metadata.df.loc[:,'scientific_name']==sp)
metadata.df = metadata.df.loc[is_sp,:].drop_duplicates().sort_values()
Yes, something like that was the plan!
Update:
--batch
should now work as intended with curate
infile
and eff_len_file
input_dir
merge
or cstmm
output directoriesmerge
directory is assumedhttps://github.com/kfuku52/amalgkit/commit/b397f1a9b9661d9cedf53a0bc201acc44a73df85 and 2b87dc92ce3e4a76af6e836d5f7c1036dd0c75e0
Thanks. --norm can take a string more than just tpm or fpkm, such as log2p1-tpm, so this does not capture all tpm-based transformations.
if args.norm == 'tpm':
Could you fix it?
sure thing!
a5d04cb9ad9df176957a18920c3046739443a336
Done!
Looks good, thanks!
@Hego-CCTB I was surprised to find that
curate
still usessbatch
. https://github.com/kfuku52/amalgkit/blob/641223f3f18c0ae953a03323a329d924b2fad8af/amalgkit/curate.py#L153-L214Please find our one-year-old discussion here https://github.com/kfuku52/amalgkit/issues/20 . In my opinion, we shouldn't introduce a dependency to SLURM, just because it makes
amalgkit
useless in most computing environments. If you have a different opinion, please let me know. If you agree, please get rid of the dependency. If you don't have time, I will.