sjteresi / TE_Density

Python script calculating transposable element density for all genes in a genome. Publication: https://mobilednajournal.biomedcentral.com/articles/10.1186/s13100-022-00264-4
GNU General Public License v3.0
28 stars 4 forks source link

Add check for non-unique genes, which causes overlap broadcast error #147

Closed sjteresi closed 3 months ago

sjteresi commented 3 months ago

Addresses #124

Problem:

Users would experience a cryptic error during the overlap calculation stage. Error would say that operands could not be broadcast together due to incorrect shape. This error would show up if users did not properly quality-control their data and duplicate gene names were present. Duplicate gene names would cause instances of GeneDatum to have more than 1 start or stop value. Having more than 1 start or stop value would break the overlap calculation function.

Fix:

Add a check to the reading of preprocessed data. This check is at the pandas dataframe stage, before things are wrapped as GeneData or GeneDatum. The check raises an error if an index (a gene name) is non-unique.