secastel / phaser

phasing and Allele Specific Expression from RNA-seq
GNU General Public License v3.0
107 stars 37 forks source link

cannot run phASER to my aligned data #57

Open x811zou opened 4 years ago

x811zou commented 4 years ago

Hi! I aligned HG00096 from 1000GP with STAR and Tophat separately. And then I ran phASER using the same parameter you gave in the tutorial.

The error message for running phASER with the BAM aligned with STAR is:

#2. Retrieving reads that overlap heterozygous sites...
     file: /data/reddylab/scarlett/1000G/data/StarOutput/HG00096/Aligned.sortedByCoord.out.bam
          minimum mapq: 255
          mapping reads to variants...
[bam_parse_region] fail to determine the sequence name.
[main_samview] region "1:" specifies an unknown reference name. Continue anyway.
[samopen] SAM header is present: 25 sequences.
[sam_read1] reference 'user command line: STAR --runMode alignReads --genomeDir /data/reddylab/scarlett/1000G/data/STARIndex --runThreadN 24 --readFilesCommand zcat --readFilesIn /gpfs/fs1/data/reddylab/scarlett/1000G/data/fastq/HG00096/ERR188040_R1.fastq.gz /gpfs/fs1/data/reddylab/scarlett/1000G/data/fastq/HG00096/ERR188040_R2.fastq.gz --outSAMtype BAM Unsorted SortedByCoordinate --outReadsUnmapped Fastx --outFileNamePrefix /data/reddylab/scarlett/1000G/data/StarOutput/HG00096/
AMtype BAM   Unsorted   SortedByCoordinate
' is recognized as '*'.
[main_samview] truncated file.
Exception in thread Thread-6:
Traceback (most recent call last):
  File "/nfs/software/helmod/apps/Core/Anaconda/2.5.0-fasrc01/x/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/nfs/software/helmod/apps/Core/Anaconda/2.5.0-fasrc01/x/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/nfs/software/helmod/apps/Core/Anaconda/2.5.0-fasrc01/x/lib/python2.7/multiprocessing/pool.py", line 389, in _handle_results
    task = get()
TypeError: ('__init__() takes at least 3 arguments (1 given)', <class 'subprocess.CalledProcessError'>, ())

The problem for using phASER with the BAM aligned with Tophat is that it stuck at first step forever ...

STARTED "Read backed phasing and ASE/haplotype analyses" ...
    DATE, TIME : 2019-11-01, 10:44:04
#1. Loading heterozygous variants into intervals...
Processing sample named HG00096
    using all the chromosomes ...
    processing VCF...

Does phASER work with Tophat aligned bam file? And what parameters do you specify for STAR alignment? I wonder whether that is the point where causing these errors. Could you help me with this? I'd appreciate your help!

Thanks, Scarlett

secastel commented 4 years ago

Hi Scarlett, I answered you in the other thread, but just to clarify for other people who may see this. It looks like you are using VCFs and BAMs with discordant chromosome naming. Please make sure that both the VCF and BAM have the same naming (e.g. "1" vs "chr1"). Once you have consistent naming between the two, make sure to use the appropriate annotation files with phASER (with or without "chr"), which can be downloaded from here: https://stephanecastel.wordpress.com/2017/02/15/how-to-generate-ase-data-with-phaser/

x811zou commented 4 years ago

@secastel Thank you! I just figured it out! But my result gene_ae.txt columns (totalCount log2_aFC n_variants variants gw_phased) are all zeros. I am wondering whether it is because my input data is unphased, which only have 0/0,1/1,0/1,1/0 in VCF. Would it explain the "0" variants below? Does phASER work with unphased data? Thanks!

#7. Outputting phased VCF...
     GT field is not being updated with phASER genome wide phase. This can be changed using the --gw_phase_vcf argument.
     Compressing and tabix indexing output VCF...

     COMPLETED using 1264249 reads in 492 seconds using 4 threads
     PHASED  10111 of 30870 all variants (= 0.327535) with at least one other variant
     GENOME WIDE PHASED  0 of 34022 unphased variants (= 0.000000)
     GENOME WIDE PHASE CORRECTED  0 of 30870 variants (= 0.000000)
     Global maximum memory usage: 1262.21 (mb)
secastel commented 4 years ago

phASER should work with an unphased VCF, although it will work much, much better if you start with a phased VCF. If you are working with human samples, I would strongly recommend phasing your VCF using a tool like the Sanger Imputation Server (https://imputation.sanger.ac.uk).

As for why you are seeing all zeros, I expect it may have something to do with a contig naming problem. When running phaser_gene_ae you need to make sure to use the right features file to match your chromosome naming. For example (assuming hg19), if your chromosomes are named “chr1” then you need to use this file: https://www.dropbox.com/s/am09zwpjhs01k8u/gencode.v19.GRCh37.genes.chr.bed.gz?dl=0 otherwise if they are named “1” you need to use this file: https://www.dropbox.com/s/1u9zo1kx61zx6ca/gencode.v19.GRCh37.genes.bed.gz?dl=0

Please let me know if this helps.

x811zou commented 4 years ago

@secastel Thank you! I just have another question regarding to the "unphased" genotype from the output phased VCF file from phASER, I do not see the phased genotype for each variant. I am attaching the first line from the input VCF and output VCF below:

##### Input VCF
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  20
chr1    14542   .   A   G   22.44   PASS    AC=2;AF=1.00;AN=2;DP=1;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;QD=22.44;SOR=1.609    GT:AD:DP:GQ:PL  1/1:0,1:1:3:32,3,0

###### Output VCF from phASER
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  20
chr1    14542   .   A   G   22.44   PASS    AC=2;AF=1.00;AN=2;DP=1;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;QD=22.44;SOR=1.609    GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC:PM    1/1:0,1:1:3:32,3,0:1/1:.:.:1/1:.:.

I just checked haplotypic_counts.txt, and I found that instead of outputting the haplotype for each variant, the output files have listed the alleles for each (A/B) haplotype at each chromosome-position range. Please let me know if I miss anything! Thank you!

secastel commented 4 years ago

The phase will be updated in the "GT" field of the outputted VCF if you have included the argument "--gw_phase_vcf 1". In the case of the first line that you've shown, first, that variant is not covered by any reads, thus phaser will not phase it, and second, it is a homozygous variant, so there is no phase to speak of. phASER runs best when your input VCF has been phased using population phasing. If you are working with human data, I would suggest using a tool like the Sanger Imputation Server.