saigegit / SAIGE

Development for SAIGE and SAIGE-GENE(+)
GNU General Public License v3.0
64 stars 27 forks source link

How to create groupfile? #80

Open Ankeet3 opened 1 year ago

Ankeet3 commented 1 year ago

I am trying to run SAIGE-GWAS Gene and Region Based Association Tests, and I am trying to figure out how to obtain legitimate groupfiles to use for the analysis. I know there is an R-script available, linked in the saige-git (https://github.com/saigegit/UKBB-WES/blob/main/make-group-file_use_annovar_output.R). But there is no supporting documentation for the script, how it should be used, what it's input parameters and output(s) are. Any help on this would be gladly appreciated?

uki-uiu commented 1 year ago

Hello! Did you manage to find out more about how to generate the groupFiles? I am looking for the same information, so if you did could you please share?

evatosco commented 1 year ago

Hi!

I am trying to run SAIGE-GENE too. I have encountered an error meanwhile but that's another story. Regarding the groupfile creation, this is my approach considering the files that I work with, which will probably differ from the ones you have. I hope this helps somehow, even partially. So I have a .grp file (groupfile) from running the same VCF input with epacts. It has this structure (tab-separated):

GENE1   1:150_A/G   1:152_C/G   1:155_C/T   1:159_A/T
GENE2   2:150_A/G   2:152_C/G   2:155_C/T   2:159_A/T
GENE3   3:150_A/G   3:152_C/G   3:155_C/T   3:159_A/T
GENE4   4:150_A/G   4:152_C/G   4:155_C/T   4:159_A/T

I tried to create a "mock" groupfile here, since the input VCF file that I have is already pre-filtered according to my consequences of interest. I annotated those variants with VEP and filtered them using filter_vep (in VEP), if that's of interest. Here's my manual approach in bash (I'm trying to create a "pass" filter since my input VCF is already filtered, as I stated above):

## replace _ and / for : ; also the first tab for a # just temporarily to keep working on it, and adding a "var" between fields
sed 's/_/:/g' $grp | sed 's|/|:|g' | sed 's/\t/#/' | awk -F'#' '{OFS=" "; print $1,"var",$2}' > ${grp}.uncompleted1
# then a loop where:
while read line
do
    # Nvars is the number of variants in each gene
    Nvars=$(echo $line | sed -e 's/ /\t/g' | cut -f 3- | sed -e 's/\s/\n/g' | wc -l)
    # print original line
    echo -e "$line" 
    # print modified line next with the same gene but then adding "anno" and "pass" as many times as variants you have in the previous line
    echo -n $line | awk -F' ' '{OFS=" "; printf "%s anno", $1}'
    for ((i=1; i<=Nvars; i++)); do echo -ne " pass" ; done
    echo ""
done < ${grp}.uncompleted1 > ${grp}.completed.grp
# replace all non-spaces by spaces just in case:
sed -i 's/\t/ /g' ${grp}.completed.grp

Obviously this is very custom to my files. If you're thinking about creating a .grp file from a VCF file via epacts annotate, please keep in mind that epacts uses a static reference -- a static version, and it might be updated, or not. If it has an old reference, it can miss genes that VEP recognizes correctly, or it might show an old name of a gene. I guess you could create a grp file from scratch (not using epacts) by usingbcftools +split-vep using the -f (format) flag, maybe indicating something like '%SYMBOL\t%CHROM:%POS:%REF%:ALT\n', then try to merge the variants of the same gene from several lines to the same line (and then adapting and using the code above). I would try using awk and ask for help if you have a colleague nearby. I would try myself here, but right now I'm in a bit of a hurry :)

I hope it gives you some perspective and it's not as confusing as it sounds! Best of luck to you all!

Edit: I have found out lately that the format of the variant names in the GroupFile that you create has to match perfectly the IDs of the variants in the plink files, in the same field where rsID are by default.

I used this code right here (in my experience, only plink2 worked for this, not v.1.9):

  plink2 \
  --bfile ${input} \
  --make-bed --allow-no-sex \
  --set-all-var-ids @:#:\$r:\$a \
  --out ${output}

to update the variant IDs of the plink files to perfectly match the format of the variants as detailed before in this comment. They may be however you want, but I personally ended up using semicolons like this: 1:150:A:G.

Hope it helps!