hepcat72 / vcfSampleCompare

Filter and rank variant call files (VCF) based on comparative evidence ratios between groups of samples.
GNU General Public License v3.0
2 stars 1 forks source link

Help with vcf format #25

Closed IsabelFE closed 3 years ago

IsabelFE commented 3 years ago

Hi,

I want to use this tool to compare groups of samples in order to find variants/SNPs more abundant in a group A vs. another group B. I tried running:

vcfSampleCompare.pl --sample-group '2021-0412-00I13_FB 2021-0412-00N21_FB 2021-0412-01B10_FB 2021-0412-01M22_FB 2021-0412-01P18_FB' --sample-group '2021-0412-02A21_FB 2021-0412-02G24_FB 2021-0413-00F17_FB 2021-0415-00F07_FB 2021-0415-01A05_FB' FB_merged_renamed.vcf.gz

And I got this error:

WARNING1: Carriage returns were found in your file and replaced with hard returns.
WARNING2: Column header line not found before data.  Using default expected FORMAT column number [9] and sample column start number [10].
ERROR1: Sample data was not found on line: [1] of VCF file [FB_merged_renamed.vcf.gz].  Skipping line.
        Supply --extended for additional details.
WARNING3: Column header line not found before data.  Using default expected FORMAT column number [9] and sample column start number [10].
ERROR2: Sample data was not found on line: [2] of VCF file [FB_merged_renamed.vcf.gz].  Skipping line.
        Supply --extended for additional details.
WARNING4: Column header line not found before data.  Using default expected FORMAT column number [9] and sample column start number [10].
WARNING4: NOTE: Further warnings of this type will be suppressed.
WARNING4: Set --error-limit to 0 to turn off error suppression
ERROR3: Sample data was not found on line: [3] of VCF file [FB_merged_renamed.vcf.gz].  Skipping line.
        Supply --extended for additional details.
ERROR3: NOTE: Further errors of this type will be suppressed.
ERROR3: Set --error-limit to 0 to turn off error suppression
WARNING47: FORMAT column header not found on column header line.  Using default expected FORMAT column number [9] and sample column start number [10].
ERROR46: A column header line with more than 1 sample name after the FORMAT column is required.  Skipping file [FB_merged_renamed.vcf.gz].
         Supply --extended for additional details.
#CHROM  POS ID  REF ALT BEST_PAIR   BEST_GT_SCORE   BEST_OR_SCORE   BEST_DP_SCORE   PAIR_ID PAIR_GT_SCORE   PAIR_OR_SCORE   PAIR_DP_SCORE   STATES_USED_GT  STATES_USED_OR  GROUP1_SAMPLES  GROUP1_GTS  GROUP1_ORS  GROUP2_SAMPLES  GROUP2_GTS  GROUP2_ORS

Done.  STATUS: [EXIT-CODE: 0 ERRORS: 46 WARNINGS: 47 TIME: 0s] SUMMARY:
    45 ERRORS LIKE: [ERROR1: Sample data was not found on line: [1] of VCF file [FB_merged_renamed.vcf.gz].  Skipping lin...]
    1 ERROR LIKE: [ERROR46: A column header line with more than 1 sample name after the FORMAT column is required.  Ski...]
    1 WARNING LIKE: [WARNING1: Carriage returns were found in your file and replaced with hard returns.]
    45 WARNINGS LIKE: [WARNING2: Column header line not found before data.  Using default expected FORMAT column number [9]...]
    1 WARNING LIKE: [WARNING47: FORMAT column header not found on column header line.  Using default expected FORMAT colu...]
    Scroll up to inspect full errors/warnings in-place.

I am not sure if my pipeline for generating the vcf file is correct. This is what I did:

  1. I used freebayes to create individual .vcf for all each of my .bam files (I tested this with 10 files):
    ls *.sorted.bam | awk -F"." '{print $1}' | sort -u | parallel 'freebayes -p 1 -C 10 -q 30 -f ~/wuhan_trim3polyA.fasta {}.sorted.bam -v {}_FB.vcf'
  2. I merged the 10 files using bcftools merge (I had to bgzip and index in order to do that):
    ls *_FB.vcf | awk -F"." '{print $1}' | sort -u | parallel 'bgzip {}.vcf'
    ls *_FB.vcf.gz | awk -F"." '{print $1}' | sort -u | parallel 'tabix {}.vcf.gz'
    ls *.vcf.gz > vcfout.list
    bcftools merge -m none --file-list vcfout.list -Oz -o FB_merged.vcf.gz --force-samples
  3. I renamed my samples on the header of the merged .vcf file. My bam files didn't had samples names, so I did this to avoid all my samples being names unknown1, unknown2...
    ls *_FB.vcf.gz | awk -F"." '{print $1}' > sample.list
    bcftools reheader -s sample.list FB_merged.vcf.gz > FB_merged_renamed.vcf.gz

    What am I missing on my final FB_merged_renamed.vcf.gz file?

Thanks

Isabel

hepcat72 commented 3 years ago

OK, this might be a simple fix, but then again, maybe this will only be the first issue to resolve.

vcfSampleCompare does not support gzip format (e.g. a file ending with .gz) - you can only supply is a .vcf format file (i.e. plain text). So try unzipping your output file with:

gzip -d FB_merged_renamed.vcf.gz

Then when you run it, if your file is very large and you're not sure it's it's doing anything, you can supply --verbose 2 to get running output on the processing (though note, ironically, this makes the script run overall, slower).

Other things you can try are:

  1. Run the tests: perl t/run_tests.t to see if your install is OK.
  2. Compare your file format the test input files in: https://github.com/hepcat72/vcfSampleCompare/tree/master/t/inputs
  3. If you can't figure it out, email me a copy of the input file and the full command (with options) you are trying to run and I can try and work it out.

Let me know how it goes.

Rob

IsabelFE commented 3 years ago

@hepcat72 thanks for your help! I unzipped my file and it worked 😊