kfuku52 / amalgkit

RNA-seq data amalgamation for a large-scale evolutionary transcriptomics
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

Error in data.frame(curate_group = curate_group_u, curate_group_color = curate_group_color[1:length(curate_group_u)], #129

Closed kfuku52 closed 1 year ago

kfuku52 commented 1 year ago

stderr

Warning message:
In dir.create(dir_pdf) :
  '/lustre7/home/lustre4/kfuku/my_project/evolutionary_transcriptomics/20230527_gfe_pipeline/gfe_data//curate/plots' already exists
Error in data.frame(curate_group = curate_group_u, curate_group_color = curate_group_color[1:length(curate_group_u)],  : 
  arguments imply differing number of rows: 0, 1
Calls: save_plot -> add_color_to_sra -> data.frame
Execution halted

stdout

Started at Wed May 31 18:09:01 JST 2023
AMALGKIT version: 0.9.11
AMALGKIT command: /home/kfuku/miniconda3/bin/amalgkit curate --out_dir /lustre7/home/lustre4/kfuku/my_project/evolutionary_transcriptomics/20230527_gfe_pipeline/gfe_data --batch 1 --input_dir /lustre7/home/lustre4/kfuku/my_project/evolutionary_transcriptomics/20230527_gfe_pipeline/gfe_data/merge --overwrite_intermediate_metadata yes
AMALGKIT bug report: https://github.com/kfuku52/amalgkit/issues
amalgkit curate: start
2023-05-31 18:09:32.993153: Loading metadata from: /lustre7/home/lustre4/kfuku/my_project/evolutionary_transcriptomics/20230527_gfe_pipeline/gfe_data/metadata/metadata.tsv
transcriptome_curation.r: mode = batch 
 [1] "/lustre7/home/lustre4/kfuku/my_project/evolutionary_transcriptomics/20230527_gfe_pipeline/gfe_data/merge/Actinidia_chinensis/Actinidia_chinensis_est_counts.tsv"
 [2] "/lustre7/home/lustre4/kfuku/my_project/evolutionary_transcriptomics/20230527_gfe_pipeline/gfe_data/curate/metadata.tsv"                                         
 [3] "/lustre7/home/lustre4/kfuku/my_project/evolutionary_transcriptomics/20230527_gfe_pipeline/gfe_data"                                                             
 [4] "/lustre7/home/lustre4/kfuku/my_project/evolutionary_transcriptomics/20230527_gfe_pipeline/gfe_data/merge/Actinidia_chinensis/Actinidia_chinensis_eff_length.tsv"
 [5] "pearson"                                                                                                                                                        
 [6] "0.2"                                                                                                                                                            
 [7] "0"                                                                                                                                                              
 [8] "0"                                                                                                                                                              
 [9] "flower|leaf|root"                                                                                                                                               
[10] "log2p1-fpkm"                                                                                                                                                    
[11] "0"                                                                                                                                                              
[12] "0.3"                                                                                                                                                            
transcriptome_curation.r: dir_work = /lustre7/home/lustre4/kfuku/my_project/evolutionary_transcriptomics/20230527_gfe_pipeline/gfe_data/ 
Number of SRA runs for this species: 0 
Number of SRA runs for selected tissues: 47 
Number of non-excluded SRA runs (exclusion=="no"): 0 
Applying FPKM transformation.
Applying log_2(x+1) normalization.
removing entries with mapping rate of 0. 
Mapping rate cutoff: 0%
No entry removed due to low mapping rate.
Entering --batch mode for amalgkit curate. processing 1 species
This is 1th job. In total, 93 jobs will be necessary for this metadata table.
processing species number  1  :  Actinidia chinensis
Found a total number of  1  species in this metadata table:
____________________________
Actinidia chinensis
____________________________
Intermediate metadata file was not detected. Preparing...
quant directory found: /lustre7/home/lustre4/kfuku/my_project/evolutionary_transcriptomics/20230527_gfe_pipeline/gfe_data/quant
Number of quant sub-directories that matched to metadata: 37
Writing curate metadata containing mapping rate: /lustre7/home/lustre4/kfuku/my_project/evolutionary_transcriptomics/20230527_gfe_pipeline/gfe_data/curate/metadata.tsv
Tissues to be included: flower, leaf, root
Input_directory: /lustre7/home/lustre4/kfuku/my_project/evolutionary_transcriptomics/20230527_gfe_pipeline/gfe_data/merge
Both counts and effective length files found.
Starting Rscript to obtain curated log2p1-fpkm values.
Time elapsed: 68 sec
amalgkit curate: end
Ended at Wed May 31 18:10:10 JST 2023
Hego-CCTB commented 1 year ago

This looks like a problem I've had before as well. Check this file: /lustre7/home/lustre4/kfuku/my_project/evolutionary_transcriptomics/20230527_gfe_pipeline/gfe_data/curate/metadata.tsv. This metadata sheet is not overwritten by default and can cause problems. It doesn't find the species in the sheet, because the metadata.tsv in the curate folder is from a previous run/different species.

The relevant discussion is here https://github.com/kfuku52/amalgkit/issues/103#issuecomment-1281923599

kfuku52 commented 1 year ago

Could you remind me why curate/metadata.tsv should be preserved by default and why curate/metadata.tsv should be species level rather than the whole table?

Hego-CCTB commented 1 year ago

Could you remind me why curate/metadata.tsv should be preserved by default

I think this is a relic from when curate/metadata.tsv was implemented. When I noticed it's being protected and made the comment I've linked above, I just preserved the behaviour.

why curate/metadata.tsv should be species level rather than the whole table?

I am wondering the same. It should be the whole table. This may be a --batch issue, because when I run curate without --batch, it does process the whole table.

Hego-CCTB commented 1 year ago

Ah, this goes back to load_metadata() in utils.py.

When --batch is active, it will load only the metadata for that species. That's the same metadata from which curate/metadata.tsv will be produced.

Hego-CCTB commented 1 year ago

I think we should preserve the species-wise processing with --batch. But I would change it so instead of curate/metadata.tsv, we create curate/metadata_Genus_species.tsv. That way we ensure that --batch doesn't cause conflicts when multiple instances of curate try to create the same file at the same time. It could also be combined with this issue: https://github.com/kfuku52/amalgkit/issues/124 And we create curate/Genus_species/metadata.tsv instead, which gets passed to the Rscript.

kfuku52 commented 1 year ago

OK, currently, curate/metadata.tsv is just a copy of merge/metadata.tsv, so I will obsolete it. curate/Genus_species/tables/Genus_species.sra.tsv is the species-wise metadata.tsv which contains new info generated by curate. To be consistent in the file naming, I will change it to curate/Genus_species/tables/Genus_species.metadata.tsv, and we can use this file to add curate-related info in the future. It would be a good idea to concatenate species-wise files into curate/metadata.tsv when all spp are completed, but I will not implement such functionality now. Maybe in the future.