lcmmichielsen / scXpresso

Predicting cell population-specific gene expression levels
MIT License
3 stars 1 forks source link

The onehot encode (h5py file ['promoter]) #3

Closed houruiyan closed 10 months ago

houruiyan commented 10 months ago

Hi Lieke,

I am sorry to bother you again. According to the information what you provided, I download the canonical transcript for each gene and used them as the symbol of gene to fetch the sequence. However, I cannot get the consistent onehot encode value as you. I do not know the reason.

For example, here, for human data, the SAMD11 (ENSG00000187634). Its transcript id is ENST00000616016. The bed file of ENST00000616016 looks like this image

However, when I transformed the onehot value to sequence and then align them to the reference genome, I found the TSS position is 925741 which is not consistent with the canonical TSS position. I do not know the reason. It maybe some error in my code.

But could you upload the code and file which was used to produce the onehot file?

I also wonder why do not ask the users to provide the fasta file and the bed file and then help them to extract sequence and then do one-hot encode? If that function can be accomplished, then we can use this package in any species.

Hope to get your help. Thank you very much!

Best regards, Ruiyan

lcmmichielsen commented 10 months ago

Hi!

I think it has to do with the version of the annotation file. As mentioned in the methods of the preprint, we use the annotation file v22 (downloaded here: https://www.gencodegenes.org/human/release_22.html). This is the same version as the original paper used according to their methods. There are newer versions already, so this could cause the difference. According to the canonical transcripts I used ENST00000616016 is indeed the canonical transcript for SAMD11, but the location in my annotation file is just different.

I created a folder in the 'tutorials' folder with some notebooks and Python code to reproduce the sequences and half-life times I used. Note that this code is definitely not the most efficient, but it works.

Maybe also an important thing to mention when comparing the one-hot encodings: the channels of my one-hot encoding represent 'ACTG' instead of the more straightforward 'ACGT', which does not influence the results, but is just good to know when comparing.

Good for you to know as well: I will be on holiday from 26 August to 18 September, so will not be responding to any issues during that time, but happy to help any other time!

houruiyan commented 10 months ago

Thank you very much for your reply with so much detailed information! Thank you very much! I will look into the code in tutorials folder.

Also thank you to let me know your holiday's information.

houruiyan commented 10 months ago

Thank you! I used the gtf version and get the same result. And I also forget to use the end point for genes which located at the reverse strand. Thank you very much for your help!