Closed neomatrix369 closed 3 years ago
I tried to figure out why this whole thing was slow and it turns out the spelling quality operations are slowing it down. And right now I have only checked till high-level analysis still granular level analysis is remaining. I have modified the code a little bit like this:
for sentence in tqdm_notebook(list(new_dataframe[text_column])):
spelling_quality_score_list.append(spelling_quality_score(sentence))
new_dataframe['spelling_quality_score'] = spelling_quality_score_list
Am not sure if this is a very efficient method or not but atleast I got to know which line is taking a lot of time.
I'm glad you have been able to resolve this temporarily for yourself.
I tried to figure out why this whole thing was slow and it turns out the spelling quality operations are slowing it down. And right now I have only checked till high-level analysis still granular level analysis is remaining. I have modified the code a little bit like this:
for sentence in tqdm_notebook(list(new_dataframe[text_column])): spelling_quality_score_list.append(spelling_quality_score(sentence)) new_dataframe['spelling_quality_score'] = spelling_quality_score_list
Am not sure if this is a very efficient method or not but atleast I got to know which line is taking a lot of time.
It adds the progress bar which is good but not sure if performance is handled in this manner. For high-level NLP features, it might need to be handled differently.
Any thoughts on how you would test for this change? How would the tests look?
While you think about it and share your thoughts, I will take a look at it and try to improve this aspect of the library.
Thanks for nudging me about it.
@strivedi02 I'm working on an implementation to improve this issue via this branch https://github.com/neomatrix369/nlp_profiler/tree/scale-when-applied-to-larger-datasets, if you can test this out separately it would be cool, also have a look at this conversation for more context: https://www.kaggle.com/viratkothari/nlp-profiler-profiling-of-textual-dataset/comments#1015859
I tried to figure out why this whole thing was slow and it turns out the spelling quality operations are slowing it down. And right now I have only checked till high-level analysis still granular level analysis is remaining. I have modified the code a little bit like this:
for sentence in tqdm_notebook(list(new_dataframe[text_column])): spelling_quality_score_list.append(spelling_quality_score(sentence)) new_dataframe['spelling_quality_score'] = spelling_quality_score_list
Am not sure if this is a very efficient method or not but atleast I got to know which line is taking a lot of time.
So now that I have looked into this again, and also worked on my implementation, you approach would help get it to use with tdqm
but speed improvements won't happen till we look at it from parallelising point of view!
Some metrics gathered during implementation of this feature, comparing before and after the implementation:
commit/branch | dataset(rows) | time taken | in seconds | speed up (x times) | run by |
---|---|---|---|---|---|
master (~ 55c6347) | 7 | 6.82 seconds | 6.82 | baseline | Mani |
master (~ 55c6347) | 100 | 211.2 seconds | 211.2 seconds | baseline | Virat |
master (~ 55c6347) | 210 | 1min 19s | 79 | baseline | Mani |
master (~ 55c6347 | 500 | (TBC) | (TBC) | baseline | Virat |
master (~ 55c6347) | 5,000 | (TBC) | (TBC) | baseline | Virat |
master (~ 55c6347) | 10,240 | 26 minutes 2 seconds | 1562 | baseline | Shubam Trivedi |
nlp_profiler.py on AI-ML-DL repo (~ bf601172) | 22,742 | 1 hour 24 mins | 5040 | baseline | Kurian |
master (~ 55c6347) | 64,295 | ~4-6 hours | 21600 | baseline | Mani |
scale-when-applied-to-larger-datasets (~ a411c13) | 7 | 7.42 seconds | 7 | -0.0879x | Mani |
scale-when-applied-to-larger-datasets (~ 78eb810) | 210 | 39.2 seconds | 39.2 | 2x | Mani |
scale-when-applied-to-larger-datasets (~ a411c13 | 500 | 455.3 seconds | 455.3 | no baseline yet | Virat |
scale-when-applied-to-larger-datasets (~ a411c13) | 5,000 | (TBC) | (TBC) | no baseline yet | Virat |
scale-when-applied-to-larger-datasets (~ a411c13) | 10,240 | 2 minutes 35 seconds | 95 | ~16.44x | Shubam Trivedi |
scale-when-applied-to-larger-datasets (~ a411c13) | 22,742 | 4min 37s | 277 | ~18.19x | Kurian |
scale-when-applied-to-larger-datasets (~ a411c13) | 64,295 | 16-23 minutes | 1380 | ~15.65x | Mani |
@strivedi02 can you please share your metrics for the above (https://github.com/neomatrix369/nlp_profiler/issues/2#issuecomment-696675059) - please provide info for each and every column possible
Closed by PR #9
@neomatrix369 for me when applied to scale-when-applied-to-larger-datasets is 4 minutes 37 seconds
Output of %%time CPU times: user 42.1 s, sys: 747 ms, total: 42.8 s Wall time: 4min 37s
4min 37s
@kurianbenoy Can you please provide the other before and after details like commit ids of the branch you used to install the library? It should not be hard to find out, if you look at the logs it should be there.
For this above time, the master branch was used and this was tested on colab.
and with the same settings the scale-when-applied-to-larger-datasets branch was used.
@strivedi02 @kurianbenoy 🙇 thanks both for the references, you can see above the updated table of the approximate speed ups
@strivedi02 thanks for raising the initial discussion #1, and pointers raised about the different issues, this and other issues have been resolved (we still have pending ones but that is fine) as result of user/community feedback and interactions.
With regards to performance of the library, it's an ongoing effort to keep in mind. But adding new NLP features would usually precede such issues.
@neomatrix369 I always had to struggle to keep all my scripts in one place or I would have to remember which code is where, but now thanks to you we won't have to remember all that. Through this package, a lot of things will become easy, and I think in the future it will keep growing in terms of usage by the community.
@neomatrix369 I always had to struggle to keep all my scripts in one place or I would have to remember which code is where, but now thanks to you we won't have to remember all that. Through this package, a lot of things will become easy, and I think in the future it will keep growing in terms of usage by the community.
That's really good to know. Glad it helps everyone. It is also what I observed, everyone was using their own recipes, now you can share and contribute and extend a central recipe.
@strivedi02 Does the library have most if not all of the things you use or one would need when dealing with text? I think there is room for a lot more.
Feel free to open issues/pull requests to extend the existing functionalities to contain additional relevant ones - that is useful for NLP practitioners.
@loopyme I'll be happy to hear your feedback on the work done via this issue, please let me know how I can answer your questions and clarify any doubts.
I have tried to build this library from ground-up paying attention to the cohesive modules and structure of the library as a whole.
At the moment the library runs slow and takes a long time to handle large datasets due to the processing require per record, this could be optimised and improved in small steps to be able to handle larger datasets
Opened on the back discussions in #1. Partially related to #3 although independent of the issue.