parklab / NGSCheckMate

Software program for checking sample matching for NGS data
MIT License
126 stars 48 forks source link

RNA-seq guidance in the manual #5

Closed cafletezbrant closed 6 years ago

cafletezbrant commented 6 years ago

I'm trying to use NGSCheckmate on the RNA-seq data (in paired end FASTQ format) from this dataset [1] and I am getting surprising results which make me question whether I am doing the right thing. This study performed RNA-seq 2 replicates for each of 17 subjects and I am trying to verify that all replicates are correctly labeled. Because these are RNA-seq data, per the advice in section 3 of the guide [2] I have set the length of the genomic regions with read mapping to be 1x10^8. The full invocation of NGScheckmate is:

python "${NCM_HOME}"/ncm_fastq.py -l ~/"${DATA_HOME}"/data/checkmate_input.tab \
       -pt "${NCM_HOME}"/SNP/SNP.pt \
       -O ~/"${DATA_HOME}"/ngscheckmate_output -p 5 -R 1x10^8

The $XX_HOME variables refer to data and NGScheckmate directories. checkmate_input.tab is a tab-delimited file that specifies each RNA-seq run's paired end files and also which subject the run comes from. What is surprising is that the result of using NGSCheckmate is that exactly 0 replicates for the same subjects are said to match. If I do not set the -R flag at all, I have 5 random matches but none are between the 2 replicates from 1 subject. I find these results hard to believe, so I am asking for pointers about using this tool in the context of RNA-seq.

What is the right way to use NGSCheckmate with RNA-seq?

Additionally, is there a sample dataset used internally to verify that the program is running correctly?

[1] https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE50893 [2] https://github.com/parklab/NGSCheckMate

sejooning commented 6 years ago

Dear Cafletezbrant,

Thank you for using the NGSCheckMate. I summarized the process to compare RNA-seq data in [1] using NGSCheckMate. To verify the performance of the NGSCheckmate in RNA-seq, I downloaded 9 SRA RNA-seq files (3 individual among 17 individual) yesterday from below link.

1) Downloading 9 SRA files

Input and replicates RNA-seq 10847

wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP030/SRP030041/SRR998295/SRR998170.sra wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP030/SRP030041/SRR998295/SRR998171.sra wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP030/SRP030041/SRR998295/SRR998172.sra

Input and replicates RNA-seq 18505

wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP030/SRP030041/SRR998295/SRR998293.sra wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP030/SRP030041/SRR998295/SRR998294.sra wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP030/SRP030041/SRR998295/SRR998295.sra

Input and replicates RNA-seq 18951

wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP030/SRP030041/SRR998327/SRR998327.sra wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP030/SRP030041/SRR998328/SRR998328.sra wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP030/SRP030041/SRR998329/SRR998329.sra

2) Converting SRA into FASTQ

fastq-dump --split-files SRR998170.sra fastq-dump --split-files SRR998171.sra fastq-dump --split-files SRR998172.sra fastq-dump --split-files SRR998293.sra fastq-dump --split-files SRR998294.sra fastq-dump --split-files SRR998295.sra fastq-dump --split-files SRR998327.sra fastq-dump --split-files SRR998328.sra fastq-dump --split-files SRR998329.sra

Generated fastq files

SRR998170_1.fastq SRR998170_2.fastq SRR998171_1.fastq SRR998171_2.fastq SRR998172_1.fastq SRR998172_2.fastq SRR998293_1.fastq SRR998293_2.fastq SRR998294_1.fastq SRR998294_2.fastq SRR998295_1.fastq SRR998295_2.fastq SRR998327_1.fastq SRR998327_2.fastq SRR998328_1.fastq SRR998328_2.fastq SRR998329_1.fastq SRR998329_2.fastq

3) Generating SRA_fastq_list.txt for ncm_fastq.py /lus/scratch/user/sjlee/GIT/NGSCheckMate/SRA/download/SRR998170_1.fastq /lus/scratch/user/sjlee/GIT/NGSCheckMate/SRA/download/SRR998170_2.fastq SRR998170 /lus/scratch/user/sjlee/GIT/NGSCheckMate/SRA/download/SRR998171_1.fastq /lus/scratch/user/sjlee/GIT/NGSCheckMate/SRA/download/SRR998171_2.fastq SRR998171 /lus/scratch/user/sjlee/GIT/NGSCheckMate/SRA/download/SRR998172_1.fastq /lus/scratch/user/sjlee/GIT/NGSCheckMate/SRA/download/SRR998172_2.fastq SRR998172 /lus/scratch/user/sjlee/GIT/NGSCheckMate/SRA/download/SRR998293_1.fastq /lus/scratch/user/sjlee/GIT/NGSCheckMate/SRA/download/SRR998293_2.fastq SRR998293 /lus/scratch/user/sjlee/GIT/NGSCheckMate/SRA/download/SRR998294_1.fastq /lus/scratch/user/sjlee/GIT/NGSCheckMate/SRA/download/SRR998294_2.fastq SRR998294 /lus/scratch/user/sjlee/GIT/NGSCheckMate/SRA/download/SRR998295_1.fastq /lus/scratch/user/sjlee/GIT/NGSCheckMate/SRA/download/SRR998295_2.fastq SRR998295 /lus/scratch/user/sjlee/GIT/NGSCheckMate/SRA/download/SRR998327_1.fastq /lus/scratch/user/sjlee/GIT/NGSCheckMate/SRA/download/SRR998327_2.fastq SRR998327 /lus/scratch/user/sjlee/GIT/NGSCheckMate/SRA/download/SRR998328_1.fastq /lus/scratch/user/sjlee/GIT/NGSCheckMate/SRA/download/SRR998328_2.fastq SRR998328 /lus/scratch/user/sjlee/GIT/NGSCheckMate/SRA/download/SRR998329_1.fastq /lus/scratch/user/sjlee/GIT/NGSCheckMate/SRA/download/SRR998329_2.fastq SRR998329

4) Excuting ncm_fastq.py using -nz option. python ncm_fastq.py -p 4 -l SRA_fastq_list.txt -pt ./SNP/SNP.pt -O ./SRA_fastq_ncm/ -nz

5) Results in ./SRA_fastq_ncm/output_matched.txt

SRR998170 matched SRR998171 0.7441 2.64 SRR998170 matched SRR998172 0.7407 2.64 SRR998171 matched SRR998172 0.8924 44.37 SRR998293 matched SRR998294 0.804 2.58 SRR998293 matched SRR998295 0.8033 2.58 SRR998294 matched SRR998295 0.9189 50.42 SRR998327 matched SRR998328 0.7906 2.59 SRR998327 matched SRR998329 0.7798 2.59 SRR998328 matched SRR998329 0.9024 50.85

These results showed that perfectly matched with each individual. [SRR998170,SRR998171,SRR998172] GM10847 [SRR998293,SRR998294,SRR998295] GM18505 [SRR998327,SRR998328,SRR998329] GM18951

Generally, I recommend to use strict cut-off values using -f options. In case of low coverage samples, I recommend to use -nz options too.

Thank you, Best regards, Sejoon Lee.

SooLee commented 6 years ago

Hi cafletezbrant

Thank you for pointing out the problem. Sorry for the late response. I did run the fastq module on the data set you mentioned as follows:

for SRR in cat $OUTDIR/SRR_list.txt do FASTQ1=$DATADIR/$SRR_1.fastq FASTQ2=$DATADIR/$SRR_2.fastq OUTFILE=$OUTDIR/$SRR.vafout bsub -q short -W 12:00 -n 4 "./ngscheckmate_fastq -1 $FASTQ1 -2 $FASTQ2 -R 1E8 -p4 $PTFILE > $OUTFILE" done


* run `vaf_ncm.py` to get the matching results - in three different options.

python ./vaf_ncm.py -I $OUTDIR -O $OUTDIR -N output.default python ./vaf_ncm.py -I $OUTDIR -O $OUTDIR -N output.nz -nz python ./vaf_ncm.py -I $OUTDIR -O $OUTDIR -N output.nzf -nz -f



As you pointed out, the first default option gives you many false positives.
The second option reduces false positives, but not completely.
I'm not sure if this data set contains any family samples, but -f option could be applied for a more stringent matching. The results from the third command was consistent with the replicate matching.

Alternatively, the vaf output files could be fed to a clustering tool (e.g. hierarchical clustering), for a more robust matching output.

Best,
Soo