griffithlab / regtools

Integrate DNA-seq and RNA-seq data to identify mutations that are associated with regulatory effects on gene expression.
https://regtools.readthedocs.org
MIT License
120 stars 26 forks source link

Comments/problems noticed while getting Example workflow to run #153

Open toddajohnson opened 3 years ago

toddajohnson commented 3 years ago

In getting the Example workflow to run on my data, I noticed a couple of problems and propose solutions (if I am not just doing something strange).

  1. The .tsv file output by cis-splice-effects identify contains gene_names and gene_ids columns (columns 15 and 16), but some other scripts in regtools appear to expect a single "genes" column. Running the variants.sh that @kcotto uploaded earlier to the scripts directory created a malformatted bed (bed should have chrom\tstart\tend columns based on next step in workflow); the first command line in variants.sh had {print $17} to extract variant information, but variant_info is column 18 in the .tsv file. Changing $17 to $18 allowed a reasonable bed file to be produced.
  2. Running vcf-concat samples/*/variants.vcf.gz | vcf-sort > all_variants_sorted.vcf failed due to "The column names do not match" (cannot concatenate Sample1 onto Sample2). I switched to using bcftools merge -Ou samples/*/variants.annotated.vcf.gz | bcftools sort -Oz -o all_variants_merged_sorted.vcf.gz -, which merged by variant and creates a sample data field for each sample.
  3. Related to #1 above, running python3 stats_wrapper.py crashed because it calls compare_junctions_hist_v2.R, which reads in file_name = paste("samples/", sample, "/output/cse_identify_filtered_compare_", tag,".tsv", sep = "") using fread and then selects the data.table cse_identify_data[,.(sample,variant_info,chrom,start,end,strand,anchor,score,name,genes)]. Since there was no genes column, it halted. I modified the R script with cse_identify_data[,genes:=gene_names] and then the Example workflow finished.
kcotto commented 3 years ago

Hi @toddajohnson, thank you for providing this valuable feedback and proposing these solutions. I had noticed 1 and 3 during a recent run-through of mine and had the exact changes you mentioned in a branch that I hadn't yet merged with master. Like I said, we really do appreciate these solutions as we want feedback and to make RegTools as easy to implement as possible.