nlp-uoregon / trankit

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
Apache License 2.0
724 stars 99 forks source link

Memory leaks, CUDA errors, and the state of this project #67

Open lemontheme opened 1 year ago

lemontheme commented 1 year ago

Based on qualitative analysis, I concluded that Trankit's dependency parser is pretty amazing compared to alternatives (e.g. spaCy, Stanza, UDPipe), and I set about incorporating Trankit into a highly parallelized job to create a new treebank of Dutch. In the testing phase, everything worked smoothly, including on GPUs. As soon as I began to scale up to more data, async CUDA memory errors started to appear.

I've tried limiting the size of texts, calling cuda.empty_cache, wrapping things in 'redo'-type logic with a backoff interval... Every time, CUDA memory eventually ran out. In the end I gave up on trying to get it to run reliably on GPU. My job has been running, albeit much slower, on CPU for a week now. I'm starting to notice that my remaining memory is becoming smaller and smaller. So here, too, it seems there is some sort of memory leak.

I'm noticing most issues here have been going unanswered. My guess is that the clever person that originally worked on this has moved on to other things. I'd just like to express that I personally find it regrettable that a project which such great model accuracy has apparrently fizzled out. (I realize this might sound somewhat entitled.)

I'll be analyzing the code to see if I can't locate the source of the memory issues. For now, I propose that the docs add a caveat about the real-life performance issues.