tkzeng / Pangolin

Pangolin is a deep-learning method for predicting splice site strengths.
GNU General Public License v3.0
61 stars 32 forks source link

Reference genome mismatch due to lowercase sequence #12

Open gabrielle-y opened 1 year ago

gabrielle-y commented 1 year ago

https://github.com/tkzeng/Pangolin/blob/5cf94b8db938c658391b4305cd7ce33297d44ff7/pangolin/pangolin.py#LL110C1-L111C1

Trying to run pangolin with the UCSC hg38 genome, which has some lowercase sequences. "[Line 64] WARNING, skipping variant: Mismatch between FASTA (ref base: g) and variant file (ref base: G)." error subsequently occurs as a result of the if statement at line 110. Attempts have been made to make seq uppercase using built in Python function however this has been unsuccessful in resolving the issue.

Would appreciate accommodations made to the script to support lowercase sequences - if resolved in the meantime, will update issue with the solution.

gabrielle-y commented 1 year ago

Found the issue - we had to re-run the pip install to regenerate the updated pangolin.py file. Appending a .upper() to line 103 overcame the error. Have not tested downstream implications.