tanjimin / C.Origami

C.Origami, a prediction and screening framework for cell type-specific 3D chromatin structure.
60 stars 9 forks source link

Issues with Training the C.Origami Model Using Only Sequence Data and Integrating Multi-Species Data #48

Open hanshandong2024 opened 1 month ago

hanshandong2024 commented 1 month ago

Dear Author,

Thank you for developing such a perfect model like C.Origami. It's a great work, but I have encountered some difficulties.

First, I have DNA sequence information from other species, but there is no corresponding ATAC data and ChIP-seq data. I would like to try training the model and making predictions using only the sequence data. Could you please advise me on how to modify the code to retrain the model?

Second, the sequences of my target species are relatively short and may differ by orders of magnitude compared to those of humans and mice. This might lead to poor training results due to insufficient training data. I would like to expand the training data by using sequences from multiple species corresponding to multiple three-dimensional structures. I noticed that our training data is input by chromosome. Is it possible to input sequence information from multiple species corresponding to multiple three-dimensional structures?

Thank you for your assistance.

tanjimin commented 1 month ago

Hi @hanshandong2024 there are all doable and make sense.

  1. Training using only sequence: You can remove the ATAC and CTCF input and the corresponding encoder, leaving just the sequence encoder. Then you can edit the output dimension of the seq encoder, make sure it has the same dimension as the transformer input size and connect it directly with the transformer.

  2. Multi-species training: I haven't tried this but you could theoretically treat each species as a "chromosome" to increase training data size. However one pitfall could be that these species might have different principles for genome organization so you model could end up learning an average of these rules, resulting in blurry results.

hanshandong2024 commented 1 month ago

Thank you for your quick reply. When using the model I trained with my data for other tasks such as Prediction and Editing/Perturbation, do I only need to input sequence data?

Below is the help documentation for the Prediction task.

Usage:
corigami-predict [options] 

Options:
-h --help       Show this screen.
--out           Output path for storing results
--celltype      Sample cell type for prediction, used for output separation
--chr           Chromosome for prediction
--start         Starting point for prediction (width defaults to 2097152 bp which is the input window size)
--model         Path to the model checkpoint
--seq           Path to the folder where the sequence .fa.gz files are stored
--ctcf          Path to the folder where the CTCF ChIP-seq .bw files are stored
--atac          Path to the folder where the ATAC-seq .bw files are stored
tanjimin commented 1 month ago

Yes you only need the seq data. Also since you change a lot of things, I would suggest you to edit and run the prediction file directly instead of using the CLI. This file:

https://github.com/tanjimin/C.Origami/blob/main/src/corigami/inference/prediction.py

hanshandong2024 commented 1 month ago

Thank you very much for your guidance.