Closed hxin closed 3 years ago
one solution is to check the picard output for the RnaSeqMetrics Strand Mapping numbers for each sample. If we have got the right strand setting, then the "Correct strand" values for every sample should be very high, e.g. > 95%.
This is shown in the multiqc report, and the numbers are stores in the here:
_multiqc_data/multiqc_picardRnaSeqMetrics.txt
_PCT_R2_TRANSCRIPT_STRANDREADS, the 19th column in the above file
Actually, I think this might only work for stranded libraries (which is most of them!). For unstranded, PCT_R2_TRANSCRIPT_STRAND_READS should be roughly 50%.
Hmm - I think something may have changed in MultiQC too - I don't think that this code is checking the correct column in the MultiQC data file now.
I just checked the multiqc results and PCT_R2_TRANSCRIPT_STRAND_READS is still the 19th column as shown below
For the fltd project sample | PCT_R2_TRANSCRIPT_STRAND_READS |
---|---|
03_06STG.human | 98.799 |
03_06VIS.human | 98.7392 |
074_00STG.human | 98.4321 |
074_00VIS.human | 98.5764 |
09_18STG.human | 98.6111 |
09_18VIS.human | 98.6019 |
09_37STG.human | 98.7456 |
09_37VIS.human | 98.718 |
11_06STG.human | 98.828 |
12_17_STG.human | 98.9076 |
12_17_VIS.human | 98.884 |
12_18STG.human | 98.6602 |
12_18VIS.human | 98.3619 |
12_30STG.human | 98.9947 |
14_04STG.human | 98.9559 |
14_04VIS.human | 98.9685 |
14_16STG.human | 98.8051 |
14_20STG.human | 98.799 |
14_20VIS.human | 98.9683 |
15_31STG.human | 98.827 |
15_31VIS.human | 98.8691 |
21_17_STG.human | 98.8803 |
24_17_VIS.human | 98.9669 |
25_17_STG.human | 98.9736 |
25_17_VIS.human | 98.7579 |
93_31STG.human | 98.7289 |
93_31VIS.human | 98.7494 |
A1051_14V.human | 97.8101 |
A110_17S.human | 97.5887 |
A1154_13S.human | 95.6841 |
A210_17V.human | 98.0122 |
A351_16S.human | 97.6599 |
The current code is checking the correct column.
strandness_qc=`cat ${MAIN_DIR}/multiqc_data/multiqc_picard_RnaSeqMetrics.txt | tail -n +2 | awk -F '\t' '$19<99 {print $1"\t"$19}'`
if [ ! `echo -n "${strandness_qc}" | wc -l` = 0 ]; then
echo "Error: The following sample may have the wrong strandness setting:"
echo "${strandness_qc}"
exit
fi
So do we want to lower this threshold value? Or do you think the test logic is wrong? @lweasel
Hmm, ok I don't know what was going on there - perhaps I was looking at the wrong file!
But yes, it looks like we should lower the threshold - 95%? Also I still think we might need to change the logic for unstranded data (in which case, we'd want something like having the percentage between 45 and 55% ?)
I tested one rev-stranded data and one unstranded data with the current code. The 0,1 and 2 are the picard strand setting parameter which is actually irrelevant here...
So the logic will be: Quote Owen:
The column we are currently checking, PCT_R2_TRANSCRIPT_STRAND_READS, is just the percentage of reads that actually map to the second strand. It doesn't have anything to do with picard setting - i.e. in unstranded data the % of reads mapping to second strand is always going to be roughly 50%. For stranded data it's either going to be very close to 0 or very close to 100, depending on the type of stranded data (it's always been the latter case for data that we have).
So - if we have previously determined that the data is unstranded, this % should always be around 50% (unless some of the samples are actually stranded).
And if we have previously determined that the data is stranded (in the case of reverse stranded data anyway), this % should always be around 100% (unlesss some of the samples are actually unstranded, or stranded in the other way).
So if we have determined the data is unstranded, we can check this number is always between 45 and 55.
And if we have determined the data is stranded, we can check this number is always > 95.
So the implemented logic looks like follow:
awk -v s=$strandedness -F '\t' '{ if (s==0 && ($19<=45 || $19>=55))
print $1"\t"$19;
else if (s==1 && $19>=5)
print $1"\t"$19;
else if (s==2 && $19<=95)
print $1"\t"$19};'`
If nothing print out, its good. If any of the _PCT_R2_TRANSCRIPT_STRANDREADS does not match the strandedness setting, the sample and its _PCT_R2_TRANSCRIPT_STRANDREADS value will be print out and cause an error to stop the script.
I have link the commit to a wrong issue. The relevant commit is d34477ff89aa72ac786fda527c93999a2e31eb8d.
It seems that the PCT_R2_TRANSCRIPT_STRAND_READS columns is not always the 19th column.
head /home/kemelianova/projects/nrf2_astrocytes_lps_jing_testfix/multiqc_data/multiqc_picard_RnaSeqMetrics.txt | cut -f 29
Need to figure out why and a fix.
The two output seems to have the same number of columns but in different orders.
It seems like selection by name rather than index would be best here! Not sure if I was running an older version of MultiQC which is the reason for the error I got :(
could you do a
multiqc --version
in your VE?
Yep! multiqc, version 1.9 - hope that helps! 👌
So I think I know what might cause the problem. I use the following code to test the multiqc run:
cd /tmp/test_multiqc
multiqc --version
multiqc -dq -f -c /home/kemelianova/projects/nrf2_astrocytes_lps_jing_testfix/multiqc_config.yaml -m picard /home/kemelianova/projects/nrf2_astrocytes_lps_jing_testfix/results >/dev/null
echo "19th column in log file:" && head -1 multiqc_data/multiqc_picard_RnaSeqMetrics.txt | cut -f 19
echo "29th column in log file:" && head -1 multiqc_data/multiqc_picard_RnaSeqMetrics.txt | cut -f 29
For a multiqc installed with python2, i.e, if you run:
mkproject test
pip install multiqc
The PCT_R2_TRANSCRIPT_STRAND_READS is always at the 19th column.
For a multiqc installed with python3, i.e, if you run:
mkproject -p python3 test
pip install multiqc
The PCT_R2_TRANSCRIPT_STRAND_READS seems to be at a random index every time. Following is a few repeats run with python3 multiqc:
19th column in log file: PCT_R2_TRANSCRIPT_STRAND_READS 29th column in log file: INCORRECT_STRAND_READS
19th column in log file: MEDIAN_3PRIME_BIAS 29th column in log file: IGNORED_READS
19th column in log file: LIBRARY 29th column in log file: PCT_R1_TRANSCRIPT_STRAND_READS
19th column in log file: INCORRECT_STRAND_READS 29th column in log file: PCT_UTR_BASES
The bit of code that produces this file is at here which seems to use a OrderedDict to hold the table. Maybe this behave differently in py2/py3?
Are you running py3 multiqc which generates the output for the nrf2_astrocytes_lps_jing_testfix project? @katieemelianova
Anyway, I have created a fix which will pull the column by name, so hopefully, it will work regardless of the order. @lweasel
Behaviour of dicts does seem to have changed between py2 and py3 (e.g. http://gandenberger.org/2018/03/10/ordered-dicts-vs-ordereddict/), so accessing by column name seems like a good idea.
In some project, samples are from different batch(sequencing centre, different date, etc.) and might have different strand protocol. We currently implicitly detected the strandness of the sample by checking an arbitrary 'first' samples from the $SAMPLES variable, and use this strandness setting on all samples. This may potentially cause problem if samples from different batches have different strandness settings.