kfuku52 / amalgkit

RNA-seq data amalgamation for a large-scale evolutionary transcriptomics
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

SLURM dependency #101

Closed kfuku52 closed 2 years ago

kfuku52 commented 2 years ago

@Hego-CCTB I was surprised to find that curate still uses sbatch. https://github.com/kfuku52/amalgkit/blob/641223f3f18c0ae953a03323a329d924b2fad8af/amalgkit/curate.py#L153-L214

Please find our one-year-old discussion here https://github.com/kfuku52/amalgkit/issues/20 . In my opinion, we shouldn't introduce a dependency to SLURM, just because it makes amalgkit useless in most computing environments. If you have a different opinion, please let me know. If you agree, please get rid of the dependency. If you don't have time, I will.

Hego-CCTB commented 2 years ago

ah, this is remnant code. I'll give it a proper --batch function and remove this part. Will fix this today.

kfuku52 commented 2 years ago

Sounds good, thank you! Just to make sure, --batch is for species-wise processing. When --batch is None, amalgkit should iterate the curation of all species one by one in a single curate command.

Hego-CCTB commented 2 years ago

yes! I will also implement this assuming amalgkit merge has been run before, so all the quantified data is in one place.

kfuku52 commented 2 years ago

curate input dir (--infile_dir) should be merge output dir or cstmm output dir. I just found there are options to specify files (--infile and --eff_len_file) rather than directories. This is weird, we only need input directory paths.

Hego-CCTB commented 2 years ago

That's true, we don't need infile/eff_len_file anymore. These were originally used in case someone used custom filenames for these (this is code from when merge was not really implemented yet and curate didn't use the metadata sheet). I'll change these to the directory path.

I won't be able to use load_metadata() from util.py for this, since it only retains one line in --batch mode. I will write a new function for this.

My idea would be:

kfuku52 commented 2 years ago

curate looks really strange now... if args.batch is None: and if args.batch is not None: look switched. Could you check and fix it if necessary?

Your plan looks good, but there should be no reason to avoid load_metadata().

it only retains one line in --batch mode.

What does it mean? curate should be a species-wise process (with --batch), not an SRA-wise process like getfastq or quant.

Hego-CCTB commented 2 years ago

What does it mean? curate should be a species-wise process (with --batch), not an SRA-wise process like getfastq or quant.

Yes, that's what I meant. We use load_metadata() for all the other subfunctions, for which the function only extracts a single line from the metadata to process (if --batch is given). For curate we need to change that behaviour to extract all lines from a certain species. I first thought about modifying the existing load_metadata(), but I think it's gonna be easier to just write a new function.

kfuku52 commented 2 years ago

I don't get what you mean still, but wouldn't simple processing like this work after load_metadata()?

spp = metadata.df.loc[:,'scientific_name'].drop_duplicates().values
sp = spp[args.batch-1]
is_sp = (metadata.df.loc[:,'scientific_name']==sp)
metadata.df = metadata.df.loc[is_sp,:].drop_duplicates().sort_values()
Hego-CCTB commented 2 years ago

Yes, something like that was the plan!

Hego-CCTB commented 2 years ago

Update:

https://github.com/kfuku52/amalgkit/commit/b397f1a9b9661d9cedf53a0bc201acc44a73df85 and 2b87dc92ce3e4a76af6e836d5f7c1036dd0c75e0

kfuku52 commented 2 years ago

Thanks. --norm can take a string more than just tpm or fpkm, such as log2p1-tpm, so this does not capture all tpm-based transformations.

if args.norm == 'tpm':

Could you fix it?

Hego-CCTB commented 2 years ago

sure thing!

Hego-CCTB commented 2 years ago

a5d04cb9ad9df176957a18920c3046739443a336

Done!

kfuku52 commented 2 years ago

Looks good, thanks!