mozack / vdjer

V'DJer - B Cell Receptor Repertoire Reconstruction from short read mRNA-Seq data
Other
28 stars 5 forks source link

Absent IGH contigs #17

Closed andrewjmc closed 5 years ago

andrewjmc commented 5 years ago

Now that I have a better understanding of the tool, I have tested VDJer on 200 samples. It runs as I expect for IGK and IGL, producing contigs in some, but not all samples, and more in cases than controls (where one may expect more circulating B cells). Usually, a sample gives both light chain contigs or none at all. However, I never get any IGH contigs, and I am puzzled.

I have confirmed that the IGH region is no worse covered than the IGK/IGL regions.

Do you have any suggestions? I will gladly share log files or other materials as needed.

Many thanks for your advice,

Andrew

mozack commented 5 years ago

Strange, I've not run into this before. Can you share the command you've used to STAR and V'DJer and also a log file for a sample that has good light chain data with missing heavy chain?

andrewjmc commented 5 years ago

Sure. I am almost certain there'll be something stupid I've done!

I have a short bash script which does the necessary

#!/bin/bash
set -x

#Get list of fastq
files=`find *.fastq`

#Get unique stems
ident=`echo "$files" | cut -d_ -f1-3 | uniq`

Iterate through file pairs
while read f; do
  echo "$f"
  #Run STAR
  STAR \
        --runThreadN 12 \
        --genomeDir ~/data/transcriptomics/star-ref \
        --readFilesIn ${f}_1.fastq ${f}_2.fastq \
        --outSAMunmapped Within \
        --outSAMtype BAM SortedByCoordinate \
        --outFileNamePrefix ./${f}_
  #Index SAM
  samtools index ${f}_Aligned.sortedByCoord.out.bam
  #Run VDJer IGH
  ~/data/vdjer/vdjer --in ${f}_Aligned.sortedByCoord.out.bam --t 12 --ins 1000 --mq 60 --mf 2 --rs 25 --ms 2 --mcs -5.5 --chain IGH --ref-dir ~/data/vdjer/human_ref/igh > ${f}_igh.vdjer.sam 2> ${f}_igh.vdjer.log
  mv vdjer.dot ${f}_igh_vdjer.dot
  mv vdj_contigs.fa ${f}_vdj_igh_contigs.fa
  #Run VDJer IGK
  ~/data/vdjer/vdjer --in ${f}_Aligned.sortedByCoord.out.bam --t 12 --ins 1000 --mq 60 --mf 2 --rs 25 --ms 2 --mcs -5.5 --chain IGK --ref-dir ~/data/vdjer/human_ref/igk > ${f}_igk.vdjer.sam 2> ${f}_igk.vdjer.log
  mv vdjer.dot ${f}_igk_vdjer.dot
  mv vdj_contigs.fa ${f}_vdj_igk_contigs.fa
  #Run VDJer IGL
  ~/data/vdjer/vdjer --in ${f}_Aligned.sortedByCoord.out.bam --t 12 --ins 1000 --mq 60 --mf 2 --rs 25 --ms 2 --mcs -5.5 --chain IGL --ref-dir ~/data/vdjer/human_ref/igl > ${f}_igl.vdjer.sam 2> ${f}_igl.vdjer.log
  mv vdjer.dot ${f}_igl_vdjer.dot
  mv vdj_contigs.fa ${f}_vdj_igl_contigs.fa
done < <(echo "$ident")

Here are two log files: WTCHG_373657_219132_igh.vdjer.log WTCHG_373657_219132_igk.vdjer.log

At the moment, I am working off a suggested insert size (I haven't aligned against the transcriptome to check this). If this could be the problem, then don't waste time looking into this. I will first validate the insert size.

Many thanks for your help,

Andrew

mozack commented 5 years ago

That is an unusually large insert size. I recommend re-computing from a transcriptome alignment.

andrewjmc commented 5 years ago

Thanks. I have done the above, and calculated a median insert size of around 200. I reran VDJer with this. Interestingly, the numbers of IGK and IGL contigs have reduced (e.g. 24 -> 2, 692 -> 34, 626 -> 30) and there are still no IGH contigs.

WTCHG_373657_219132_igk.vdjer.log WTCHG_373657_219132_igh.vdjer.log

Here are the same log files again in case it helps.

Thanks again,

Andrew

samalberman commented 5 years ago

Hi,

I am having exactly the same issue - for some samples, I am getting plenty of kappa and lambda light chain contigs but I never seem to get any heavy chain contigs.

Here is the script I am using:

#!/bin/bash

set -x

f=sample_1

#run STAR alignment
~/STAR/STAR \
        --runThreadN 12 \
        --genomeDir ~/data/ref_genome/STAR_index \
        --readFilesIn ~/data/RNASeq_Data/samples/${f}_1.fa ~/data/RNASeq_Data/samples/${f}_2.fa \
        --outSAMunmapped Within \
        --outSAMtype BAM SortedByCoordinate \
        --outFileNamePrefix ~/data/RNASeq_Data/bam/${f}_

#index SAM file
~/vdjer/samtools index ~/data/RNASeq_Data/bam/${f}_Aligned.sortedByCoord.out.bam

#run VDJer for IGH
~/vdjer/vdjer --in ~/data/RNASeq_Data/bam/${f}_Aligned.sortedByCoord.out.bam --t 12 --ins 200 --mq 60 --mf 2 --rs 25 --ms 2 --mcs -5.5 --chain IGH --ref-dir ~/data/RNASeq_Data/human_references/IGH > ${f}_IGH.vdjer.sam 2> ${f}_IGH.vdjer.log

#run VDJer for IGK
~/vdjer/vdjer --in ~/data/RNASeq_Data/bam/${f}_Aligned.sortedByCoord.out.bam --t 12 --ins 200 --mq 60 --mf 2 --rs 25 --ms 2 --mcs -5.5 --chain IGK --ref-dir ~/data/RNASeq_Data/human_references/IGK > ${f}_IGK.vdjer.sam 2> ${f}_IGK.vdjer.log

#run VDJer for IGL
~/vdjer/vdjer --in ~/data/RNASeq_Data/bam/${f}_Aligned.sortedByCoord.out.bam --t 12 --ins 200 --mq 60 --mf 2 --rs 25 --ms 2 --mcs -5.5 --chain IGL --ref-dir ~/data/RNASeq_Data/human_references/IGL > ${f}_IGL.vdjer.sam 2> ${f}_IGL.vdjer.log

As you can see, I am using an insert size of 200 (as specified by the sequencing facility), running in sensitive mode and including unmapped reads from the STAR alignment.

Here are the corresponding log files:

sample_1_IGH.vdjer.log

sample_1_IGK.vdjer.log

sample_1_IGL.vdjer.log

Did you make any headway with @andrewjmc's request? If so, any advice would be much appreciated!

mozack commented 5 years ago

Sorry, I don't have a quick answer for this and I have not noticed this behavior across a cohort in any of the datasets that I have processed.

It looks like your reads are in fasta format - Is that correct? Lack of base quality scores may convolute the assembly.

You might try running Mixcr on your dataset to see if it detects heavy chains and also determine if the heavy chain counts have a relationship with the light chain counts.

andrewjmc commented 5 years ago

Just to get back and close the issue... you won't be surprised to know it was an internal issue.

I was not aware when I was given the files that individual samples had reads split across multiple files, as such we were running only a subset.

It still perplexes me why this only affected the heavy chains (as we could see heavy chain reads in the samples, and I could not conceive a reason for different loci to be systematically segregated differently among files), unless perhaps the depth threshold for reconstructing heavy chains is higher (due to the greater size/complexity).

Thanks for your suggestions and sorry for troubling you with this unnecessarily.