kfuku52 / amalgkit

RNA-seq data amalgamation for a large-scale evolutionary transcriptomics
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

Error in h(simpleError(msg, call)) #61

Closed kfuku52 closed 3 years ago

kfuku52 commented 3 years ago

This error occurred in curate with a very large Apis dataset, but not in its subset. I'll investigate more next week.

/Users/kef74yk/anaconda3/envs/intellij/bin/python /Users/kef74yk/Dropbox_p/repos/amalgkit/amalgkit/amalgkit curate --infile ./merge/Apis_mellifera/Apis_mellifera_est_counts.tsv --eff_len_file ./merge/Apis_mellifera/Apis_mellifera_eff_length.tsv --metadata ./metadata/metadata/metadata_03_curated_20210623.tsv --norm fpkm
amalgkit curate: start
Updated metadata file was not detected. Preparing...
Number of quant sub-directories that matched to metadata: 809
Writing updated metadata: /Users/kef74yk/Dropbox (Personal)/collaborators/Daniel_Friedman/20210618_amalgkit/amalgkit_out/curate/metadata.tsv
Tissues to be included: adipose_W, brain_M, brain_Q, brain_W, hypopharyngeal_glands_W, antennae_W, malpighian_tubule_W, mandibular_gland_W, midgut_W, nasonov_gland_W, second_thoracic_ganglia_W, skeletal_muscle_W, sting_gland_W, ovary_W, mushroom_bodies_M, mushroom_bodies_W, larval_gut_W, adipose_Q, mandibular_gland_Q, head_and_thorax_Q, embryo_M, embryo_W
Single species mode
Both counts and effective length provided.
Calculating:  fpkm
transcriptome_curation.r: mode = batch 
 [1] "/Users/kef74yk/Dropbox (Personal)/collaborators/Daniel_Friedman/20210618_amalgkit/amalgkit_out/merge/Apis_mellifera/Apis_mellifera_est_counts.tsv"                                                                                                                                                                       
 [2] "/Users/kef74yk/Dropbox (Personal)/collaborators/Daniel_Friedman/20210618_amalgkit/amalgkit_out/curate/metadata.tsv"                                                                                                                                                                                                      
 [3] "/Users/kef74yk/Dropbox (Personal)/collaborators/Daniel_Friedman/20210618_amalgkit/amalgkit_out"                                                                                                                                                                                                                          
 [4] "/Users/kef74yk/Dropbox (Personal)/collaborators/Daniel_Friedman/20210618_amalgkit/amalgkit_out/merge/Apis_mellifera/Apis_mellifera_eff_length.tsv"                                                                                                                                                                       
 [5] "pearson"                                                                                                                                                                                                                                                                                                                 
 [6] "0.2"                                                                                                                                                                                                                                                                                                                     
 [7] "0"                                                                                                                                                                                                                                                                                                                       
 [8] "0"                                                                                                                                                                                                                                                                                                                       
 [9] "adipose_W|brain_M|brain_Q|brain_W|hypopharyngeal_glands_W|antennae_W|malpighian_tubule_W|mandibular_gland_W|midgut_W|nasonov_gland_W|second_thoracic_ganglia_W|skeletal_muscle_W|sting_gland_W|ovary_W|mushroom_bodies_M|mushroom_bodies_W|larval_gut_W|adipose_Q|mandibular_gland_Q|head_and_thorax_Q|embryo_M|embryo_W"
[10] "fpkm"                                                                                                                                                                                                                                                                                                                    
[11] "0"                                                                                                                                                                                                                                                                                                                       
transcriptome_curation.r: dir_work = /Users/kef74yk/Dropbox (Personal)/collaborators/Daniel_Friedman/20210618_amalgkit/amalgkit_out/ 
Number of SRA runs for this species: 925 
Number of SRA runs for selected tisssues: 925 
Number of non-excluded SRA runs (exclusion=="no"): 809 
Warning messages:
1: In xtfrm.data.frame(x) : cannot xtfrm data frames
2: In xtfrm.data.frame(x) : cannot xtfrm data frames
3: In xtfrm.data.frame(x) : cannot xtfrm data frames
initial  value 25.124216 
final  value 25.124216 
converged
Warning messages:
1: In xtfrm.data.frame(x) : cannot xtfrm data frames
2: In xtfrm.data.frame(x) : cannot xtfrm data frames
3: In xtfrm.data.frame(x) : cannot xtfrm data frames
4: In cor(tc, method = dist_method) : the standard deviation is zero
Number of significant surrogate variables is:  19 
Iteration (out of 10 ):1  2  3  4  5  6  7  8  9  10  SVA correction was correctly performed.
Warning messages:
1: In xtfrm.data.frame(x) : cannot xtfrm data frames
2: In xtfrm.data.frame(x) : cannot xtfrm data frames
3: In xtfrm.data.frame(x) : cannot xtfrm data frames
initial  value 30.577108 
final  value 30.577108 
converged
Warning messages:
1: In xtfrm.data.frame(x) : cannot xtfrm data frames
2: In xtfrm.data.frame(x) : cannot xtfrm data frames
3: In xtfrm.data.frame(x) : cannot xtfrm data frames
4: In xtfrm.data.frame(x) : cannot xtfrm data frames
5: In xtfrm.data.frame(x) : cannot xtfrm data frames
6: In xtfrm.data.frame(x) : cannot xtfrm data frames
Mapping rate cutoff: 20%
Removed due to low mapping rate:
SRR6880250: mapping rate = 13.8%
SRR6880251: mapping rate = 4%
SRR6880254: mapping rate = 13.2%
SRR6880255: mapping rate = 10.4%
SRR6880256: mapping rate = 10.5%
SRR6880258: mapping rate = 2.4%
SRR6880259: mapping rate = 2.9%
SRR6880260: mapping rate = 3.7%
SRR6880266: mapping rate = 12%
SRR6880267: mapping rate = 7.5%
SRR6880268: mapping rate = 5.8%
SRR6880270: mapping rate = 3.9%
SRR6880271: mapping rate = 6.9%
SRR6880272: mapping rate = 10.2%
SRR6880273: mapping rate = 6.1%
SRR6880274: mapping rate = 4%
SRR6880275: mapping rate = 5.1%
SRR6880283: mapping rate = 13.5%
SRR6880280: mapping rate = 14.3%
SRR6880282: mapping rate = 19.9%
SRR6880293: mapping rate = 5.6%
SRR6880296: mapping rate = 7.2%
SRR11447327: mapping rate = 19.4%
SRR11447326: mapping rate = 19.3%
SRR11447325: mapping rate = 12.1%
SRR9007915: mapping rate = 10.7%
SRR9007914: mapping rate = 12.1%
SRR1036397: mapping rate = 0%
SRR1036399: mapping rate = 0%
SRR1036398: mapping rate = 0%
SRR1613229: mapping rate = 1.1%
initial  value 26.877097 
iter   5 value 15.018159
iter  10 value 13.409272
iter  15 value 13.195291
iter  15 value 13.183281
iter  15 value 13.181568
final  value 13.181568 
converged
Warning messages:
1: In xtfrm.data.frame(x) : cannot xtfrm data frames
2: In xtfrm.data.frame(x) : cannot xtfrm data frames
3: In xtfrm.data.frame(x) : cannot xtfrm data frames
Number of significant surrogate variables is:  20 
Iteration (out of 10 ):1  2  3  4  5  6  7  8  9  10  SVA correction was correctly performed.
Warning messages:
1: In xtfrm.data.frame(x) : cannot xtfrm data frames
2: In xtfrm.data.frame(x) : cannot xtfrm data frames
3: In xtfrm.data.frame(x) : cannot xtfrm data frames
initial  value 29.668834 
iter   5 value 16.097654
iter  10 value 15.089423
final  value 14.927551 
converged
iteratively checking within-tissue correlation, round: 2 
Warning messages:
1: In xtfrm.data.frame(x) : cannot xtfrm data frames
2: In xtfrm.data.frame(x) : cannot xtfrm data frames
3: In xtfrm.data.frame(x) : cannot xtfrm data frames
4: In xtfrm.data.frame(x) : cannot xtfrm data frames
5: In xtfrm.data.frame(x) : cannot xtfrm data frames
6: In xtfrm.data.frame(x) : cannot xtfrm data frames
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'unique': undefined columns selected
Calls: check_within_tissue_correlation -> unique -> [ -> [.data.frame
Execution halted
Time elapsed: 1,638 sec
amalgkit curate: end

Process finished with exit code 0
Hego-CCTB commented 3 years ago

Hm. I introduced two calls of unique() to check for cases in which all samples of a tissue belong to the same bioproject. This call runs on the metadata table. I suspect whatever causes those may cause problems in the metadata, which results in the error.

kfuku52 commented 3 years ago

Could you indicate where the unique calls you think suspicious are?

kfuku52 commented 3 years ago

OK, maybe I found it. The if condition you originally introduced looks like this: https://github.com/kfuku52/amalgkit/blob/cbd6852060319083283ca9f062a106709c97e63d/amalgkit/transcriptome_curation.r#L275

I don't know why unique() is required here, because sra2_run_other_bp is a data.frame, not a vector. But more importantly, length() returns the number of columns, not rows, for a data.frame, so it always returns a fixed number no matter how many SRA runs remain there. Could you tell me how you tested this code? If I can improve your coding environment, I'm happy to help.

Hego-CCTB commented 3 years ago

I used browser() to enter debug and see which part of the if condition the function enters when running through my dataset. I used my Helianthus dataset, where all flower samples came from the same bioproject. I set my_tissue to flower and ran it through browser, looked in which part of the if statement I end up and did the same with leaf.

For my testset this works, because sra2_run_other_bp[sra2_run_other_bp$tissue == my_tissue] returns a dataframe with 0 columns, when my_tissue == flower. This was certainly not how I inteded this to work, but I didn't notice, because it actually did what it was supposed to do.

Hego-CCTB commented 3 years ago

Oh, and I just realized why it returns a dataset with 0 columns. It's because sra2_run_other_bp$tissue == my_tissue returns enough FALSE, so that sra2_run_other_bp[sra2_run_other_bp$tissue == my_tissue] ends up empty.

kfuku52 commented 3 years ago

In this particular case, I'd suggest first checking what's stored in sra2_run_other_bp$tissue == my_tissue and then sra2_run_other_bp[sra2_run_other_bp$tissue == my_tissue] (the error should be detected here) before passing them to other functions like length(unique(sra2_run_other_bp[sra2_run_other_bp$tissue == my_tissue])).

And of course, we need a good test dataset. https://github.com/kfuku52/amalgkit/issues/41

kfuku52 commented 3 years ago

fixed in https://github.com/kfuku52/amalgkit/commit/cdd71a0a19d6dfcccd2624104d02af28a576dd24