Closed docxology closed 2 years ago
This error warning usually occurs when a row (expression values of a single gene in all samples) has the exact same value for each sample, so SD is 0. Happens often in cases of genes that are simply not expressed.
Genes without expression should be filtered out before applying SVA and then added back to the dataframe afterwards, so this warning message should not be caused by 0 expression, which is weird.
What's the input for this set? I usually provide count and length file automatically and let amalgkit curate
do the conversion into TPM/FPKM.
Can you identify any rows in the expression table with non-zero values, but equal in every column?
"What's the input for this set?"
"Can you identify any rows in the expression table with non-zero values, but equal in every column?" For the Species_cstmm_counts.tsv, there are 47 rows with stdev.p = 0, and all these rows have all 0 expression. Then all the rest of the target_id have non-zero stdev.p (so they do not have equal expression across all SRA), however some calculated values are very small (e.g. 5.3E-10 is smallest non-zero stdev.p).
"What's the input for this set? " -- What other information or file can I provide.
Thank you for the help ~
For the Species_cstmm_counts.tsv, there are 47 rows with stdev.p = 0, and all these rows have all 0 expression. Then all the rest of the target_id have non-zero stdev.p (so they do not have equal expression across all SRA), however some calculated values are very small (e.g. 5.3E-10 is smallest non-zero stdev.p).
Is this also true for species.uncorrected.tc.tsv
?
The amalgkit command you used may be helpful too
For species.uncorrected.tc.tsv , it is the same pattern (47 rows with 0 stdev.p and also all 0 expression, then all other rows with non-zero values (smallest value here through is E-15, whereas it was E-10 for Species_cstmm_counts.tsv).
The amalgkit command used was:
sci_name="Apis_mellifera"
file_metadata='/home/osboxes/Documents/20210618_amalgkit/amalgkit_out/metadata/metadata/metadata_03_curated_20210623_no_brain_M.tsv'
dir_out='/home/osboxes/Documents/20210618_amalgkit/amalgkit_out/no_brain_M'
dir_count='/home/osboxes/Documents/20210618_amalgkit/amalgkit_out/merge/'
python3.9 amalgkit merge --out_dir ${dir_out} --metadata ${file_metadata}
python3.9 amalgkit cstmm --out_dir ${dir_out} --count ${dir_count} --ortho ./hoge
python3.9 amalgkit curate \
--out_dir ${dir_out} \
--infile ${dir_out}/cstmm/${sci_name}/${sci_name}_cstmm_counts.tsv \
--eff_len_file ${dir_out}/cstmm/${sci_name}/${sci_name}_eff_length.tsv \
--metadata ${file_metadata} \
--norm fpkm
Does this warning occur in just one iteration, or in multiple?
In rows, where cor()
has issues with an SD of 0, it will produce NAs, which are visible in the heatmap. You'll find a number of PDF files in out_dir/curate/plots/
named in the pattern species.iteration-number.correlation_cutoff.sva.pdf
.
Can you send me the correlation_cutoff.sva.pdf of the steps, where this message occured?
Only one iteration.
Sent you an email with the PDF.
@C20H25N30 , I found out what causes this.
Short version: This can safely be ignored and is not connected to issue https://github.com/kfuku52/amalgkit/issues/85
Long version: This warning message only happens during round 0. In this round, samples with low mapping rate have not yet been removed. In the dataset you sent me, there are 3 samples with a mapping rate of 0. This means, that all expression values for those 3 samples are 0, which means a standard deviation of 0 for those samples. Pearson correlation needs an SD > 0, because it will divide by the SD along the way. If it's 0, it will cause NAs.
This doesn't cause any issues down the line, because these samples will be removed between round 0 and round 1 after mapping rate filtering.
Did you make a change to suppress this warning in round 0?
No, but it should be easily possible to exclude samples of a mapping rate of 0 before going into round 0. I'll add a quick update.
Sounds good, thanks!
Added a 0% mapping rate check before round 0 amalgkit ver. 0.5.2 https://github.com/kfuku52/amalgkit/commit/5eeafaaf3a72e9b2c87647394d78e5c3ac323dea
cat() should take a string ending with '\n' but it looks OK otherwise.
yeah, just noticed that myself. fixed. https://github.com/kfuku52/amalgkit/commit/d870b99d19a27a0f441961978ac678ca7f359bb7
When running amalgkit, in the terminal output during the transcriptome_curation.r step, I see "cor(tc, method = dist_method) : the standard deviation is zero".
Not sure what kind of Warning message this is, since amalgkit seems to have run successfully, just wanted to flag it though. Thank you.