kr-colab / locator

deep learning prediction of geographic location from individual genome sequences
Other
46 stars 18 forks source link

Question relating to the data processing steps (Need Help :( #5

Closed FritzPeleke closed 4 years ago

FritzPeleke commented 4 years ago

Hi :), I have read the paper and it is a really beautiful work. I have a school project to train a neural network that can predict the origin of herpes viruses. I am to use just the DNA sequences from NCBI for this. I understood the build-up of your neural network in your paper but I did not understand how you processed your sequences to get data you could feed to your network. My idea concerning my project is to align the sequences say with a tool like MUSCLE and send it to a numpy array and then encode the nucleotides to numeric values and feed my neural network. I do not know if this plan of mine is good. Could you please explain to me how you processed your data to obtain neural-network feedable data? And with your expert opinion does my method sound reasonable or could you please suggest a better approach if any? Thanks in advance

andrewkern commented 4 years ago

hi there- yes your plan to encode nucleotides numerically may work well. i would also encourage you to check out one-hot encoding, as that is popular with folks working in the nucleotide alphabet input space