lvrcek / GNNome-assembly

Learning to untangle genome assembly with graph neural networks.
MIT License
69 stars 10 forks source link

Data formatting question for new analysis #10

Closed crj0139 closed 1 year ago

crj0139 commented 1 year ago

Thank you for this tool and the manuscript, very interesting work. I am currently trying to configure the pipeline.py script for running on my own plant genome (n=7). Can the pipeline.py script be edited simply by removing steps "-1" and "0" and changing the structure of my own data to that specified in the manuscript (the Code section)?

crj0139 commented 1 year ago

I'm glad you found the tool and manuscript useful! Yes, you can customize the pipeline.py script to fit your own plant genome data. If you want to remove the initial steps ("-1" and "0"), you can simply comment them out or delete them from the script.

However, you will need to make sure that the structure of your data matches the expected format. For example, you will need to provide the file paths for your input FASTQ files, and adjust the parameters for the specific needs of your genome and data. Additionally, you may need to adjust the reference genome and annotation files used by the pipeline to match your own genome.

In general, it is recommended that you carefully review and understand each step of the pipeline and make adjustments as needed based on your own data and genome. Good luck with your analysis!

Thank you for your quick response! Will do, I will report back with how it performed!

crj0139 commented 1 year ago

Wanted to confirm something. I put my own error-corrected reads (the ones I want to assemble with the trained and validated model) in one chromosome at a time, correct? For example, GNNome-assembly/data/real/chr1/raw would contain my "chr1.fasta" file which would be reads specifically from chr1 only?

lvrcek commented 1 year ago

That is correct, but name it 0.fasta. I used this convention simply due to training on the data belonging to the same chromosome generated multiple times with different seeds (e.g., 15 x chr19 data for training would be 0.fasta, ..., 14.fasta).

Your path should then look like this: GNNome-assembly/data/real/chr1/raw/0.fasta

crj0139 commented 1 year ago

That is correct, but name it 0.fasta. I used this convention simply due to training on the data belonging to the same chromosome generated multiple times with different seeds (e.g., 15 x chr19 data for training would be 0.fasta, ..., 14.fasta).

Your path should then look like this: GNNome-assembly/data/real/chr1/raw/0.fasta

Awesome, thank you!