Closed kfuku52 closed 1 year ago
This looks like a problem I've had before as well. Check this file: /lustre7/home/lustre4/kfuku/my_project/evolutionary_transcriptomics/20230527_gfe_pipeline/gfe_data/curate/metadata.tsv
. This metadata sheet is not overwritten by default and can cause problems. It doesn't find the species in the sheet, because the metadata.tsv in the curate folder is from a previous run/different species.
The relevant discussion is here https://github.com/kfuku52/amalgkit/issues/103#issuecomment-1281923599
Could you remind me why curate/metadata.tsv
should be preserved by default and why curate/metadata.tsv
should be species level rather than the whole table?
Could you remind me why
curate/metadata.tsv
should be preserved by default
I think this is a relic from when curate/metadata.tsv
was implemented. When I noticed it's being protected and made the comment I've linked above, I just preserved the behaviour.
why
curate/metadata.tsv
should be species level rather than the whole table?
I am wondering the same. It should be the whole table. This may be a --batch issue, because when I run curate without --batch, it does process the whole table.
Ah, this goes back to load_metadata() in utils.py.
When --batch is active, it will load only the metadata for that species. That's the same metadata from which curate/metadata.tsv will be produced.
I think we should preserve the species-wise processing with --batch. But I would change it so instead of curate/metadata.tsv
, we create curate/metadata_Genus_species.tsv
. That way we ensure that --batch doesn't cause conflicts when multiple instances of curate
try to create the same file at the same time.
It could also be combined with this issue:
https://github.com/kfuku52/amalgkit/issues/124
And we create curate/Genus_species/metadata.tsv
instead, which gets passed to the Rscript.
OK, currently, curate/metadata.tsv
is just a copy of merge/metadata.tsv
, so I will obsolete it. curate/Genus_species/tables/Genus_species.sra.tsv
is the species-wise metadata.tsv which contains new info generated by curate
. To be consistent in the file naming, I will change it to curate/Genus_species/tables/Genus_species.metadata.tsv
, and we can use this file to add curate-related info in the future. It would be a good idea to concatenate species-wise files into curate/metadata.tsv
when all spp are completed, but I will not implement such functionality now. Maybe in the future.
stderr
stdout