kfuku52 / amalgkit

RNA-seq data amalgamation for a large-scale evolutionary transcriptomics
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

Obsoleting updated_metadata #103

Closed kfuku52 closed 1 year ago

kfuku52 commented 1 year ago

Mapping rate should be obtainable from a kallisto output file, so updated_metadata is no longer necessary as we discussed a long time ago. @Hego-CCTB Could you work on that? It would be great if you could apply this update in 2 weeks or so. If you need more time, or if you don't have time in the coming weeks, please let me know.

Hego-CCTB commented 1 year ago

I'm on it!

Hego-CCTB commented 1 year ago

Turns out we already implemented that the mapping rate is being obtained from a .json file from the kallisto output. We just never removed the calls to the updated_metadata functions, so the extra files were still being produced. I confirmed this by commenting all those calls out and running the complete pipeline from start to finish on a handful of test samples and amalgkit worked just fine.

During curate there is still a metadata.tsv file being created in the curate folder, which contains the mapping rates and is used during the execution of the R script.

I would just completely remove any calls to the updated_metadata related functions. Shall I remove the functions as well?

kfuku52 commented 1 year ago

Thank you. Could you remove the functions as well? Also, the mapping rates are summarized only in curate and it may not be convenient because the final step of some use cases may be just merge or cstmm, not reaching curate. Would it be possible to make merge produces merge/metadata.tsv that contains the mapping rates just like curate/metadata.tsv?

Hego-CCTB commented 1 year ago

Any point after quant should be quite alright. Should I leave get_mapping_rate() still as a part of curate (as a backup), or should I move this to merge completely?

kfuku52 commented 1 year ago

get_mapping_rate() should still be there in curate because we want to use curate/metadata.tsv without merging it with merge/metadata.tsv if the project reaches curate.

Hego-CCTB commented 1 year ago

The update was very straight forward. I was hoping to push the SVA alternatives update as well, but I need to work out some kinks still. For now, here's the update: https://github.com/kfuku52/amalgkit/commit/b8d2944ac1f051caf5ee5ce8fc83c89927e4faf0

Something I've noticed while testing is the following problem with this: When re-using the same project folder, but different input data, curate will cause problems, because it specifically never overwrites any metadata.tsv. So it will try to use an old metadata.tsv, even though the expression data may not line up with its contents. The current solution is to either make a new project folder every time, or manually delete the metadata.tsv from the curate folder.

I see two quick solutions to this:

kfuku52 commented 1 year ago

I prefer the --redo option. I have managed similar problems by date before, but problems are more likely to occur when incorporating such date-based output files into a pipeline.

Hego-CCTB commented 1 year ago

Added --overwrite_metadata or -o in the last push. I decided not to use --redo, since --redo usually refers to all files involved in a rerun. https://github.com/kfuku52/amalgkit/commit/5104adf34fee3e26395633e62dfdd8e2f7625332

kfuku52 commented 1 year ago

Thanks. Please close the issue if everything is done.

kfuku52 commented 1 year ago

amalgkit merge fails due to a missing argument (--overwrite_metadata) when merge/metadata.tsv exists. I added --overwrite_metadata to merge in the last commit ( https://github.com/kfuku52/amalgkit/commit/d138af4d353c4f5838eef539edd3ecf5ca7b0a73 ), but is there any reason not to overwrite merge/metadata.tsv despite merge/GENUS_SPECIES is newly generated every time? Also, the option name --overwrite_metadata is confusing because it implies the update of the file specified by --metadata, but not.

Traceback (most recent call last):
  File "/Users/kf/Dropbox/repos/amalgkit/amalgkit/amalgkit", line 381, in <module>
    args.handler(args)
  File "/Users/kf/Dropbox/repos/amalgkit/amalgkit/amalgkit", line 77, in command_merge
    merge_main(args)
  File "/Users/kf/Dropbox/repos/amalgkit/amalgkit/merge.py", line 52, in merge_main
    write_updated_metadata(metadata, os.path.join(args.out_dir, 'merge', 'metadata.tsv'), args)
  File "/Users/kf/Dropbox/repos/amalgkit/amalgkit/curate.py", line 21, in write_updated_metadata
    if not args.overwrite_metadata:
AttributeError: 'Namespace' object has no attribute 'overwrite_metadata'
kfuku52 commented 1 year ago

I believe this part is no longer necessary. Please remove it if correct. @Hego-CCTB I am still waiting for your response to the above problem.

https://github.com/kfuku52/amalgkit/blob/e220d1de4bb6fd42459312d9ccb086345f758216/amalgkit/sanity.py#L69-L81

Hego-CCTB commented 1 year ago

amalgkit merge fails due to a missing argument (--overwrite_metadata) when merge/metadata.tsv exists. I added --overwrite_metadata to merge in the last commit ( d138af4 ), but is there any reason not to overwrite merge/metadata.tsv despite merge/GENUS_SPECIES is newly generated every time? Also, the option name --overwrite_metadata is confusing because it implies the update of the file specified by --metadata, but not.

Traceback (most recent call last):
  File "/Users/kf/Dropbox/repos/amalgkit/amalgkit/amalgkit", line 381, in <module>
    args.handler(args)
  File "/Users/kf/Dropbox/repos/amalgkit/amalgkit/amalgkit", line 77, in command_merge
    merge_main(args)
  File "/Users/kf/Dropbox/repos/amalgkit/amalgkit/merge.py", line 52, in merge_main
    write_updated_metadata(metadata, os.path.join(args.out_dir, 'merge', 'metadata.tsv'), args)
  File "/Users/kf/Dropbox/repos/amalgkit/amalgkit/curate.py", line 21, in write_updated_metadata
    if not args.overwrite_metadata:
AttributeError: 'Namespace' object has no attribute 'overwrite_metadata'

This should absolutely not happen as this is an argument exclusively used in curate. Must have slipped in when I was simplifying the subparser. This will be removed.

Hego-CCTB commented 1 year ago

I believe this part is no longer necessary. Please remove it if correct. @Hego-CCTB I am still waiting for your response to the above problem.

https://github.com/kfuku52/amalgkit/blob/e220d1de4bb6fd42459312d9ccb086345f758216/amalgkit/sanity.py#L69-L81

Yup, we don't need this anymore!

Hego-CCTB commented 1 year ago

ab1ff5d174515f8468cb0d268047847a0ab60230 and https://github.com/kfuku52/amalgkit/commit/418632b70ad4e6e7d0e37cfb6cd38a878bb7eca7