optimize spacy result parsing

vcastro commented 3 years ago

Really great package!

I started playing around on a large corpus and found the parsing of the spacy results to be slow. I've split out the parsing into a separate function and used lapply to iterate through each text. It seems to be working much faster now.

I also moved the progress bar to the main clinspacy function to measure progress across texts rather than entities.

kdpsingh commented 3 years ago

@vcastro This looks amazing. All of the outside loops are optimized (using data.table) but as you have figured out, the inside loop was quite slow (and the progress bar was originally linked to the inside loop).

Everything you've done is stuff I've been meaning to get to but haven't had the chance. Let me just give your changes a spin and quickly look through your source and will plan to commit if all works okay on my end.

kdpsingh commented 3 years ago

@vcastro this is great. The speedup is very noticeable and I love the updated functionality of the progress bar! There are a few issues that need to be addressed. I am happy to look into these later this week.

Empty matches create a data frame with 4 columns only. For example, clinspacy('hello') creates a data.frame with the columns id, cui, semantic_type, and definition. This is problematic for vectorized operations (more than one value in the character vector) because it will fail to rbind when the column names don't match.
Each output creates an extra empty row. Notice that clinspacy('hello') creates a result with 2 rows (instead of 1 or 0). Even for situations where an entity is matched, an extra row is created in my testing.
I love that you've added confidence and cui to all clinspacy() output (even when the linker is disabled) because it creates a clear expectation for the user about the columns contained in the output file. We may want to just keep the column order consistent. Currently, the column order differs when the linker is on vs. off.

kdpsingh commented 3 years ago

Closing this PR.

Thanks for giving me the idea! I ended up rewriting mainly to prevent some edge cases from breaking. I incorporated your two main ideas -- use of lapply() and moving the progress bar to the document level as opposed to the token level.

I fixed this issue in https://github.com/ML4LHS/clinspacy/commit/d6aff34828619005ec0f29042b8505f0a0de5fa6

kdpsingh / clinspacy

optimize spacy result parsing #8