Feat: Feature gene annotation

grantn5 commented 3 weeks ago

I have written some code that will produce a per gene CNV matrix, giving the mean log2 CNV per gene (as 1 gene can appear in many windows) I have also updated the np.convolve mode to valid so it matches the documentation in the function. Full methods here

I have also updated the code in the instance where the window is larger than the number of genes on the chromosome, to only run one convolution the length of the genes on the given chr see https://github.com/grantn5/infercnvpy/blob/feature_geneAnnotation/src/infercnvpy/tl/_infercnv.py#L198

I have also updated the genomic_position_from_gtf function so it will always read the gtf in as a pandas df (avoids bug where it reads it in as polars df)

gtf = gtfparse.read_gtf(
        gtf_file, usecols=["seqname", "feature", "start", "end", "gene_id", "gene_name"], result_type="pandas"
    )

Closes #128

grantn5 commented 3 weeks ago

Hi @grst I am not familiar with this building of docs. It is failing due to a warning because the sphinx can't seem to find scanpy in the build environment, any suggestions?

grst commented 3 weeks ago

It seems the pca function got moved within scanpy. I fixed it in another PR. Once you resolved the merge conflicts, readthedocs should build again.

I'll then do a proper review.

grantn5 commented 3 weeks ago

Hi @grst I have pulled your changes and have reverted genomic_position_from_gtf. This is now ready fro review!

grantn5 commented 2 weeks ago

Hi @grst It looks like all tests that use the oligodendroglioma adata are failing? Has this object been moved in scanpy?

grst commented 2 weeks ago

I think this is rather an incompatibility with numpy 2.0 which was very recently released. I'll look into it, but the issue might be in one of the dependencies.

grantn5 commented 2 weeks ago

Just some minor comments. Thanks for adding test cases :)

Have you compared the runtime before and after adding your code? It looks like it could make the whole workflow significantly slower and memory intense. In that case it would be great if it could be disabled via a function parameter in tl.infercnvpy.

Hi thanks for the review!

I have tested the time on our in-house data ~50,000 cells, ~20,000 genes and it runs within about 10 minutes parallelized over 32 cpu's (before it took ~2 mins). However due to the nested for loop you have to massively reduce the chunksize ~100 as each sample needs to be looped through the flattened index of genes for each window https://github.com/grantn5/infercnvpy/blob/feature_geneAnnotation/src/infercnvpy/tl/_infercnv.py#L230 which will be very long.

I also found that there is a bug in n_jobs where it isn't automatically parallelized over all available cpus's so I have always been setting it manually.

This is definitely a CPU bound task rather than memory so if you are using a machine with lots of CPUs it will speed up immensely.

If required for the PR to be accepted I can do further refactoring to add a parameter to turn off per gene CNV calling by default. However, this may take some time due to the way the function is handling outputs from intermediate functions.

grst commented 2 weeks ago

Thanks for testing this. Given that it's a 5x runtime increase and scalability is one of the main selling points of this library, I would like to be able to switch this off. For 50k cells it might not be much of an issue, but when working with atlases >1M cells, this makes quite a difference.

grantn5 commented 1 week ago

Hi @grst I have made the calculation of per gene CNV optional, default set to False, I have also added some documentation to the function.

Additionally, I have added some benchmarking to the pytests to show the performance of the added computation. Hopefully this is now good merge!

icbi-lab / infercnvpy

Feat: Feature gene annotation #132