kfuku52 / amalgkit

RNA-seq data amalgamation for a large-scale evolutionary transcriptomics
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

Warning message: "cor(tc, method = dist_method) : the standard deviation is zero" #84

Closed docxology closed 2 years ago

docxology commented 2 years ago

When running amalgkit, in the terminal output during the transcriptome_curation.r step, I see "cor(tc, method = dist_method) : the standard deviation is zero".

Not sure what kind of Warning message this is, since amalgkit seems to have run successfully, just wanted to flag it though. Thank you.

transcriptome_curation.r: dir_work = /home/osboxes/Documents/20210618_amalgkit/amalgkit_out/with_brain_M/ 
Number of SRA runs for this species: 925 
Number of SRA runs for selected tisssues: 925 
Number of non-excluded SRA runs (exclusion=="no"): 809 
initial  value 27.312586 
iter   5 value 13.278421
iter  10 value 9.374159
iter  15 value 4.603239
iter  20 value 3.350343
iter  25 value 2.566446
iter  30 value 1.810457
iter  35 value 1.322577
iter  40 value 1.179548
iter  45 value 1.120166
iter  50 value 1.085901
iter  55 value 1.070313
final  value 1.066690 
converged
Warning message:
In cor(tc, method = dist_method) : the standard deviation is zero
Number of significant surrogate variables is:  19 
Iteration (out of 10 ):1  2  3  4  5  6  7  8  9  10  SVA correction was correctly performed.
initial  value 27.639213 
final  value 27.639213 
converged
Hego-CCTB commented 2 years ago

This error warning usually occurs when a row (expression values of a single gene in all samples) has the exact same value for each sample, so SD is 0. Happens often in cases of genes that are simply not expressed. Genes without expression should be filtered out before applying SVA and then added back to the dataframe afterwards, so this warning message should not be caused by 0 expression, which is weird.

What's the input for this set? I usually provide count and length file automatically and let amalgkit curate do the conversion into TPM/FPKM.

Can you identify any rows in the expression table with non-zero values, but equal in every column?

docxology commented 2 years ago

"What's the input for this set?"

"Can you identify any rows in the expression table with non-zero values, but equal in every column?" For the Species_cstmm_counts.tsv, there are 47 rows with stdev.p = 0, and all these rows have all 0 expression. Then all the rest of the target_id have non-zero stdev.p (so they do not have equal expression across all SRA), however some calculated values are very small (e.g. 5.3E-10 is smallest non-zero stdev.p).

"What's the input for this set? " -- What other information or file can I provide.

Thank you for the help ~

Hego-CCTB commented 2 years ago

For the Species_cstmm_counts.tsv, there are 47 rows with stdev.p = 0, and all these rows have all 0 expression. Then all the rest of the target_id have non-zero stdev.p (so they do not have equal expression across all SRA), however some calculated values are very small (e.g. 5.3E-10 is smallest non-zero stdev.p).

Is this also true for species.uncorrected.tc.tsv ?

Hego-CCTB commented 2 years ago

The amalgkit command you used may be helpful too

docxology commented 2 years ago

For species.uncorrected.tc.tsv , it is the same pattern (47 rows with 0 stdev.p and also all 0 expression, then all other rows with non-zero values (smallest value here through is E-15, whereas it was E-10 for Species_cstmm_counts.tsv).

The amalgkit command used was:

sci_name="Apis_mellifera"
file_metadata='/home/osboxes/Documents/20210618_amalgkit/amalgkit_out/metadata/metadata/metadata_03_curated_20210623_no_brain_M.tsv'
dir_out='/home/osboxes/Documents/20210618_amalgkit/amalgkit_out/no_brain_M'
dir_count='/home/osboxes/Documents/20210618_amalgkit/amalgkit_out/merge/'

python3.9 amalgkit merge --out_dir ${dir_out} --metadata ${file_metadata}

python3.9 amalgkit cstmm --out_dir ${dir_out} --count ${dir_count} --ortho ./hoge

python3.9 amalgkit curate \
--out_dir ${dir_out} \
--infile ${dir_out}/cstmm/${sci_name}/${sci_name}_cstmm_counts.tsv \
--eff_len_file ${dir_out}/cstmm/${sci_name}/${sci_name}_eff_length.tsv \
--metadata ${file_metadata} \
--norm fpkm
Hego-CCTB commented 2 years ago

Does this warning occur in just one iteration, or in multiple? In rows, where cor() has issues with an SD of 0, it will produce NAs, which are visible in the heatmap. You'll find a number of PDF files in out_dir/curate/plots/ named in the pattern species.iteration-number.correlation_cutoff.sva.pdf.

Can you send me the correlation_cutoff.sva.pdf of the steps, where this message occured?

docxology commented 2 years ago

Only one iteration.

Sent you an email with the PDF.

Hego-CCTB commented 2 years ago

@C20H25N30 , I found out what causes this.

Short version: This can safely be ignored and is not connected to issue https://github.com/kfuku52/amalgkit/issues/85

Long version: This warning message only happens during round 0. In this round, samples with low mapping rate have not yet been removed. In the dataset you sent me, there are 3 samples with a mapping rate of 0. This means, that all expression values for those 3 samples are 0, which means a standard deviation of 0 for those samples. Pearson correlation needs an SD > 0, because it will divide by the SD along the way. If it's 0, it will cause NAs.

This doesn't cause any issues down the line, because these samples will be removed between round 0 and round 1 after mapping rate filtering.

kfuku52 commented 2 years ago

Did you make a change to suppress this warning in round 0?

Hego-CCTB commented 2 years ago

No, but it should be easily possible to exclude samples of a mapping rate of 0 before going into round 0. I'll add a quick update.

kfuku52 commented 2 years ago

Sounds good, thanks!

Hego-CCTB commented 2 years ago

Added a 0% mapping rate check before round 0 amalgkit ver. 0.5.2 https://github.com/kfuku52/amalgkit/commit/5eeafaaf3a72e9b2c87647394d78e5c3ac323dea

kfuku52 commented 2 years ago

cat() should take a string ending with '\n' but it looks OK otherwise.

Hego-CCTB commented 2 years ago

yeah, just noticed that myself. fixed. https://github.com/kfuku52/amalgkit/commit/d870b99d19a27a0f441961978ac678ca7f359bb7