sjteresi / TE_Density

Python script calculating transposable element density for all genes in a genome. Publication: https://mobilednajournal.biomedcentral.com/articles/10.1186/s13100-022-00264-4
GNU General Public License v3.0
28 stars 4 forks source link

Annotation revision #41

Closed sjteresi closed 3 years ago

sjteresi commented 3 years ago

[X] Revised Annotation [X] fake | subset annotation data? [ ] Provide documentation about ulimit [X] Revise on Order basis [X] Revise on SuperFamily basis [X] Revise on all TEs. Merge every single TE regardless of identity. [ ] viola/jones annotation implementation? (later)

Can you help me figure out how to make sure that the user has the ulimit -s unlimited? You mentioned something about putting it in the makefile...? This is now necessary to run density.py because you have to create a revised annotation if it doesn't already exist.

The annotation revision takes a really long time. About 4 hours for Camarosa's updated TE library, probably really inefficient due to recursion. So in a future release I'd like to visit the topic of making it better and not reliant upon makingulimit -s unlimited.

sjteresi commented 3 years ago

I figured out how to put ulimit -s unlimited in the Makefile. I didn't quite understand the makefile's usage or purpose in general (never really interacted with altering Makefiles before) but studied it up and was able to include the ulimit thing in my branch where I am working on the changes for your refactoring steps.

So in summary I got that working, but we should still talk about the performance of the code.

teresi commented 3 years ago

you should probably just do something like this rather than the makefile

[ins]▸>>> import os
[ins]▸>>> os.system("ulimit -a")
time(seconds)        unlimited
file(blocks)         unlimited
data(kbytes)         unlimited
stack(kbytes)        8192
coredump(blocks)     0
memory(kbytes)       unlimited
locked memory(kbytes) 16384
process              127873
nofiles              1024
vmemory(kbytes)      unlimited
locks                unlimited
rtprio               0

that's more clear and doesn't require one to use the makefile which is really just for convenience, not really canonical

teresi commented 3 years ago

can you squash and rebase?

did you forget to add your Makefile change?

have you considered the os.system solution above?

have you considered using the Makefile to update any of these gene files you need? that is what Make is really for, making files

sjteresi commented 3 years ago

can you squash and rebase?

did you forget to add your Makefile change?

have you considered the os.system solution above?

have you considered using the Makefile to update any of these gene files you need? that is what Make is really for, making files

The os.system method would not work for modifying the stack size, however I did find the proper method for modifying the stack size through the use of the resource module. I placed the appropriate code for that in the init method of Revise_Anno.

I will explore the idea of extracting out the Revise_Anno class and separating that out from the normal operation of the code. Since it only needs to be run once it may be a good idea to place the execution of that separate from running the normal code. But as it stands now, it only generates it if it needs it, but the end-user may want to be able to run that portion purely separately.

Thank you, the commit messages were useful. I also found this useful video on Youtube.

teresi commented 3 years ago

overall I think you have some good additions here you have documentation and are precise in your naming

you have merge conflicts so I will leave it at some feedback:

changes to

    def concat_single_chrom(self, chromosome):
        """Mutate the complete chromosome with concatenated dataframes for the given chromosome.

        Loop over the dictionary of pandadataframes that are on a type-by-type
        basis and concatenate them all into the new transposon annotation.

        Args:
            chromosome (str): String representing the current chromosome
        """