problem with strandness detection logic

hxin commented 4 years ago

In some project, samples are from different batch(sequencing centre, different date, etc.) and might have different strand protocol. We currently implicitly detected the strandness of the sample by checking an arbitrary 'first' samples from the $SAMPLES variable, and use this strandness setting on all samples. This may potentially cause problem if samples from different batches have different strandness settings.

SAMPLES="S1 S2 S3 S4"
SAMPLE_STRAND_TEST=${SAMPLES%% *}

hxin commented 4 years ago

one solution is to check the picard output for the RnaSeqMetrics Strand Mapping numbers for each sample. If we have got the right strand setting, then the "Correct strand" values for every sample should be very high, e.g. > 95%.

This is shown in the multiqc report, and the numbers are stores in the here: _multiqc_data/multiqc_picardRnaSeqMetrics.txt
_PCT_R2_TRANSCRIPT_STRANDREADS, the 19th column in the above file

lweasel commented 3 years ago

Actually, I think this might only work for stranded libraries (which is most of them!). For unstranded, PCT_R2_TRANSCRIPT_STRAND_READS should be roughly 50%.

lweasel commented 3 years ago

Hmm - I think something may have changed in MultiQC too - I don't think that this code is checking the correct column in the MultiQC data file now.

hxin commented 3 years ago

I just checked the multiqc results and PCT_R2_TRANSCRIPT_STRAND_READS is still the 19th column as shown below

For the fltd project sample	PCT_R2_TRANSCRIPT_STRAND_READS
03_06STG.human	98.799
03_06VIS.human	98.7392
074_00STG.human	98.4321
074_00VIS.human	98.5764
09_18STG.human	98.6111
09_18VIS.human	98.6019
09_37STG.human	98.7456
09_37VIS.human	98.718
11_06STG.human	98.828
12_17_STG.human	98.9076
12_17_VIS.human	98.884
12_18STG.human	98.6602
12_18VIS.human	98.3619
12_30STG.human	98.9947
14_04STG.human	98.9559
14_04VIS.human	98.9685
14_16STG.human	98.8051
14_20STG.human	98.799
14_20VIS.human	98.9683
15_31STG.human	98.827
15_31VIS.human	98.8691
21_17_STG.human	98.8803
24_17_VIS.human	98.9669
25_17_STG.human	98.9736
25_17_VIS.human	98.7579
93_31STG.human	98.7289
93_31VIS.human	98.7494
A1051_14V.human	97.8101
A110_17S.human	97.5887
A1154_13S.human	95.6841
A210_17V.human	98.0122
A351_16S.human	97.6599

The current code is checking the correct column.

strandness_qc=`cat ${MAIN_DIR}/multiqc_data/multiqc_picard_RnaSeqMetrics.txt | tail -n +2 | awk -F '\t' '$19<99 {print $1"\t"$19}'`
if [ ! `echo -n "${strandness_qc}" | wc -l` = 0 ]; then
    echo "Error: The following sample may have the wrong strandness setting:"
    echo "${strandness_qc}"
    exit
fi

So do we want to lower this threshold value? Or do you think the test logic is wrong? @lweasel

lweasel commented 3 years ago

Hmm, ok I don't know what was going on there - perhaps I was looking at the wrong file!

But yes, it looks like we should lower the threshold - 95%? Also I still think we might need to change the logic for unstranded data (in which case, we'd want something like having the percentage between 45 and 55% ?)

hxin commented 3 years ago

test the logic with an unstranded data in /srv/data/ghardingham/activity_dependence_jqiu/human_esc_neurons and a stranded data in the fltd project to see how the % looks like.

hxin commented 3 years ago

I tested one rev-stranded data and one unstranded data with the current code. The 0,1 and 2 are the picard strand setting parameter which is actually irrelevant here...

So the logic will be: Quote Owen:

The column we are currently checking, PCT_R2_TRANSCRIPT_STRAND_READS, is just the percentage of reads that actually map to the second strand. It doesn't have anything to do with picard setting - i.e. in unstranded data the % of reads mapping to second strand is always going to be roughly 50%. For stranded data it's either going to be very close to 0 or very close to 100, depending on the type of stranded data (it's always been the latter case for data that we have).

So - if we have previously determined that the data is unstranded, this % should always be around 50% (unless some of the samples are actually stranded).

And if we have previously determined that the data is stranded (in the case of reverse stranded data anyway), this % should always be around 100% (unlesss some of the samples are actually unstranded, or stranded in the other way).

So if we have determined the data is unstranded, we can check this number is always between 45 and 55.

And if we have determined the data is stranded, we can check this number is always > 95.

hxin commented 3 years ago

So the implemented logic looks like follow:

awk -v s=$strandedness -F '\t' '{ if (s==0 && ($19<=45 || $19>=55))
                                      print $1"\t"$19;
                                   else if (s==1 && $19>=5)
                                      print $1"\t"$19;
                                   else if (s==2 && $19<=95)
                                      print $1"\t"$19};'`

If nothing print out, its good. If any of the _PCT_R2_TRANSCRIPT_STRANDREADS does not match the strandedness setting, the sample and its _PCT_R2_TRANSCRIPT_STRANDREADS value will be print out and cause an error to stop the script.

hxin commented 3 years ago

I have link the commit to a wrong issue. The relevant commit is d34477ff89aa72ac786fda527c93999a2e31eb8d.

hxin commented 3 years ago

It seems that the PCT_R2_TRANSCRIPT_STRAND_READS columns is not always the 19th column.

head  /home/kemelianova/projects/nrf2_astrocytes_lps_jing_testfix/multiqc_data/multiqc_picard_RnaSeqMetrics.txt | cut -f 29

Need to figure out why and a fix.

hxin commented 3 years ago

The two output seems to have the same number of columns but in different orders.

katieemelianova commented 3 years ago

It seems like selection by name rather than index would be best here! Not sure if I was running an older version of MultiQC which is the reason for the error I got :(

hxin commented 3 years ago

could you do a multiqc --version in your VE?

katieemelianova commented 3 years ago

Yep! multiqc, version 1.9 - hope that helps! 👌

hxin commented 3 years ago

So I think I know what might cause the problem. I use the following code to test the multiqc run:

cd /tmp/test_multiqc
multiqc --version
multiqc -dq -f -c /home/kemelianova/projects/nrf2_astrocytes_lps_jing_testfix/multiqc_config.yaml -m picard /home/kemelianova/projects/nrf2_astrocytes_lps_jing_testfix/results >/dev/null
echo "19th column in log file:" && head -1 multiqc_data/multiqc_picard_RnaSeqMetrics.txt | cut -f 19
echo "29th column in log file:" && head -1 multiqc_data/multiqc_picard_RnaSeqMetrics.txt | cut -f 29

For a multiqc installed with python2, i.e, if you run:

mkproject test
pip install multiqc

The PCT_R2_TRANSCRIPT_STRAND_READS is always at the 19th column.

For a multiqc installed with python3, i.e, if you run:

mkproject -p python3 test
pip install multiqc

The PCT_R2_TRANSCRIPT_STRAND_READS seems to be at a random index every time. Following is a few repeats run with python3 multiqc:

19th column in log file: PCT_R2_TRANSCRIPT_STRAND_READS 29th column in log file: INCORRECT_STRAND_READS

19th column in log file: MEDIAN_3PRIME_BIAS 29th column in log file: IGNORED_READS

19th column in log file: LIBRARY 29th column in log file: PCT_R1_TRANSCRIPT_STRAND_READS

19th column in log file: INCORRECT_STRAND_READS 29th column in log file: PCT_UTR_BASES

The bit of code that produces this file is at here which seems to use a OrderedDict to hold the table. Maybe this behave differently in py2/py3?

Are you running py3 multiqc which generates the output for the nrf2_astrocytes_lps_jing_testfix project? @katieemelianova

Anyway, I have created a fix which will pull the column by name, so hopefully, it will work regardless of the order. @lweasel

lweasel commented 3 years ago

Behaviour of dicts does seem to have changed between py2 and py3 (e.g. http://gandenberger.org/2018/03/10/ordered-dicts-vs-ordereddict/), so accessing by column name seems like a good idea.

sidbdri / cookiecutter-de_analysis_skeleton

problem with strandness detection logic #127