shirondru / enformer_fine_tuning

For fine-tuning Enformer using paired WGS & gene expression data
7 stars 2 forks source link

Questions about personal genome training #1

Closed HelloWorldLTY closed 3 weeks ago

HelloWorldLTY commented 3 weeks ago

Hi, thanks for your interesting work. When processing the personal genome, I wonder how to handle the diploid information of human genome. I notice that some methods will double the dna sequence length to use two chromatins information for prediction. Do you also use the alleles information for prediction? Thanks a lot.

shirondru commented 3 weeks ago

Thanks for your interest! As you say, there are many possible ways to handle this. We chose to do so by one-hot encoding each of the two sequences and using the average of the two matrices as our input.

Can you clarify what you mean about your second question? The two sequences that we average together as input are sequences that include someone's genetic variants. Hence, the input reflects the allele information, but I think I am misunderstanding your question.

HelloWorldLTY commented 3 weeks ago

Thanks for your answers! It is really helpful!

    def _one_hot_encode_diploid(self,seq1,seq2):
        """
        Returns a single one hot encoded sequence from an unphased diploid genome in order to pass in as input into the model.
        It does this by taking two haplotypes from a diploid genome, one hot encoding each, and taking the average.
        Heterozygous positions are therefore encoded as 0.5s.
        This is only appropriate for unphased sequences without indels. It should be changed for phased genomes since heterozygous SNPs can be mapped to one haplotype or the other in that case.
        """
        one_hot_seq1 = self._one_hot_encode(seq1)
        one_hot_seq2 = self._one_hot_encode(seq2)
        return (one_hot_seq1 + one_hot_seq2) / 2 

I tried to use the consenus method for getting sequences informaiton of each individual (from the cram file), but in the fa file I do not find the diploid infroamtion. I think I should try your vcf design. Thanks a lot.