tkzeng / Pangolin

Pangolin is a deep-learning method for predicting splice site strengths.
GNU General Public License v3.0
61 stars 32 forks source link

Suggestion for alternative output/VEP plugin? #16

Open jessicakan789 opened 6 months ago

jessicakan789 commented 6 months ago

Great tool!

Just a future suggestion - maybe give the user the option to output pangolin results in separate columns rather than just in the INFO column of VCF files as it makes it difficult for users to read?

Alternatively, an even better method would be to write a VEP plugin.

Thank you! :)

jessicakan789 commented 6 months ago

Example in R:

# import libraries
library(dplyr)
library(stringr)
library(vcfR)
library(tidyr)

# parse command line arguments
args <- commandArgs(trailingOnly = TRUE) 

# import data
vcf <- read.vcfR(args[1], verbose = FALSE)

# read vcf as tidyverse dataframes
vcf_df <- vcfR2tidy(vcf)
vcf_df_fix <- vcf_df$fix

# split INFO column into multiple columns
info <- extract_info_tidy(vcf)

# isolate Pangolin column
pangolin_df <- info['Pangolin']

# [How to Split Column Into Multiple Columns in R DataFrame? - GeeksforGeeks](https://www.geeksforgeeks.org/how-to-split-column-into-multiple-columns-in-r-dataframe/)
edit_pangolin_df <- str_split_fixed(pangolin_df$Pangolin, '[\\|:] ', 6)

# change column names
colnames(edit_pangolin_df) <- c('Pangolin gene', 'Pangolin pos_1', 'Pangolin score_change_1', 'Pangolin pos_2', 'Pangolin score_change_2', 'Pangolin warnings')

# merge pangolin columns with original data
merged_df <- cbind(vcf_df_fix, edit_pangolin_df)

# write out to csv
write.csv(merged_df, args[2], row.names=FALSE)