Able to run at scale: handle larger datasets

neomatrix369 commented 4 years ago

At the moment the library runs slow and takes a long time to handle large datasets due to the processing require per record, this could be optimised and improved in small steps to be able to handle larger datasets

Opened on the back discussions in #1. Partially related to #3 although independent of the issue.

strivedi02 commented 4 years ago

I tried to figure out why this whole thing was slow and it turns out the spelling quality operations are slowing it down. And right now I have only checked till high-level analysis still granular level analysis is remaining. Screenshot from 2020-09-13 03-03-35 I have modified the code a little bit like this:

        for sentence in tqdm_notebook(list(new_dataframe[text_column])):
            spelling_quality_score_list.append(spelling_quality_score(sentence))
        new_dataframe['spelling_quality_score'] = spelling_quality_score_list

Am not sure if this is a very efficient method or not but atleast I got to know which line is taking a lot of time.

neomatrix369 commented 4 years ago

I'm glad you have been able to resolve this temporarily for yourself.

neomatrix369 commented 4 years ago

I tried to figure out why this whole thing was slow and it turns out the spelling quality operations are slowing it down. And right now I have only checked till high-level analysis still granular level analysis is remaining. I have modified the code a little bit like this:
        for sentence in tqdm_notebook(list(new_dataframe[text_column])):
            spelling_quality_score_list.append(spelling_quality_score(sentence))
        new_dataframe['spelling_quality_score'] = spelling_quality_score_list
Am not sure if this is a very efficient method or not but atleast I got to know which line is taking a lot of time.

It adds the progress bar which is good but not sure if performance is handled in this manner. For high-level NLP features, it might need to be handled differently.

Any thoughts on how you would test for this change? How would the tests look?

neomatrix369 commented 4 years ago

While you think about it and share your thoughts, I will take a look at it and try to improve this aspect of the library.

Thanks for nudging me about it.

neomatrix369 commented 4 years ago

@strivedi02 I'm working on an implementation to improve this issue via this branch https://github.com/neomatrix369/nlp_profiler/tree/scale-when-applied-to-larger-datasets, if you can test this out separately it would be cool, also have a look at this conversation for more context: https://www.kaggle.com/viratkothari/nlp-profiler-profiling-of-textual-dataset/comments#1015859

neomatrix369 commented 4 years ago

I tried to figure out why this whole thing was slow and it turns out the spelling quality operations are slowing it down. And right now I have only checked till high-level analysis still granular level analysis is remaining. I have modified the code a little bit like this:
        for sentence in tqdm_notebook(list(new_dataframe[text_column])):
            spelling_quality_score_list.append(spelling_quality_score(sentence))
        new_dataframe['spelling_quality_score'] = spelling_quality_score_list
Am not sure if this is a very efficient method or not but atleast I got to know which line is taking a lot of time.

So now that I have looked into this again, and also worked on my implementation, you approach would help get it to use with tdqm but speed improvements won't happen till we look at it from parallelising point of view!

neomatrix369 commented 3 years ago

Some metrics gathered during implementation of this feature, comparing before and after the implementation:

commit/branch	dataset(rows)	time taken	in seconds	speed up (x times)	run by
master (~ 55c6347)	7	6.82 seconds	6.82	baseline	Mani
master (~ 55c6347)	100	211.2 seconds	211.2 seconds	baseline	Virat
master (~ 55c6347)	210	1min 19s	79	baseline	Mani
master (~ 55c6347	500	(TBC)	(TBC)	baseline	Virat
master (~ 55c6347)	5,000	(TBC)	(TBC)	baseline	Virat
master (~ 55c6347)	10,240	26 minutes 2 seconds	1562	baseline	Shubam Trivedi
nlp_profiler.py on AI-ML-DL repo (~ bf601172)	22,742	1 hour 24 mins	5040	baseline	Kurian
master (~ 55c6347)	64,295	~4-6 hours	21600	baseline	Mani
scale-when-applied-to-larger-datasets (~ a411c13)	7	7.42 seconds	7	-0.0879x	Mani
scale-when-applied-to-larger-datasets (~ 78eb810)	210	39.2 seconds	39.2	2x	Mani
scale-when-applied-to-larger-datasets (~ a411c13	500	455.3 seconds	455.3	no baseline yet	Virat
scale-when-applied-to-larger-datasets (~ a411c13)	5,000	(TBC)	(TBC)	no baseline yet	Virat
scale-when-applied-to-larger-datasets (~ a411c13)	10,240	2 minutes 35 seconds	95	~16.44x	Shubam Trivedi
scale-when-applied-to-larger-datasets (~ a411c13)	22,742	4min 37s	277	~18.19x	Kurian
scale-when-applied-to-larger-datasets (~ a411c13)	64,295	16-23 minutes	1380	~15.65x	Mani

neomatrix369 commented 3 years ago

@strivedi02 can you please share your metrics for the above (https://github.com/neomatrix369/nlp_profiler/issues/2#issuecomment-696675059) - please provide info for each and every column possible

neomatrix369 commented 3 years ago

Closed by PR #9

kurianbenoy commented 3 years ago

@neomatrix369 for me when applied to scale-when-applied-to-larger-datasets is 4 minutes 37 seconds

Output of %%time CPU times: user 42.1 s, sys: 747 ms, total: 42.8 s Wall time: 4min 37s

neomatrix369 commented 3 years ago

4min 37s

@kurianbenoy Can you please provide the other before and after details like commit ids of the branch you used to install the library? It should not be hard to find out, if you look at the logs it should be there.

strivedi02 commented 3 years ago

Screenshot from 2020-09-23 01-30-19 For this above time, the master branch was used and this was tested on colab.

Screenshot from 2020-09-23 01-42-55 and with the same settings the scale-when-applied-to-larger-datasets branch was used.

kurianbenoy commented 3 years ago

@neomatrix369 I was running this in Kaggle. The previous experiment with associated time can be found here. I was probably using version 21 of your NLP Profiler Class notebook.

The recent version can be found here. I hope it helps you find the exact version

neomatrix369 commented 3 years ago

@strivedi02 @kurianbenoy 🙇 thanks both for the references, you can see above the updated table of the approximate speed ups

neomatrix369 commented 3 years ago

@strivedi02 thanks for raising the initial discussion #1, and pointers raised about the different issues, this and other issues have been resolved (we still have pending ones but that is fine) as result of user/community feedback and interactions.

With regards to performance of the library, it's an ongoing effort to keep in mind. But adding new NLP features would usually precede such issues.

strivedi02 commented 3 years ago

@neomatrix369 I always had to struggle to keep all my scripts in one place or I would have to remember which code is where, but now thanks to you we won't have to remember all that. Through this package, a lot of things will become easy, and I think in the future it will keep growing in terms of usage by the community.

neomatrix369 commented 3 years ago

@neomatrix369 I always had to struggle to keep all my scripts in one place or I would have to remember which code is where, but now thanks to you we won't have to remember all that. Through this package, a lot of things will become easy, and I think in the future it will keep growing in terms of usage by the community.

That's really good to know. Glad it helps everyone. It is also what I observed, everyone was using their own recipes, now you can share and contribute and extend a central recipe.

@strivedi02 Does the library have most if not all of the things you use or one would need when dealing with text? I think there is room for a lot more.

Feel free to open issues/pull requests to extend the existing functionalities to contain additional relevant ones - that is useful for NLP practitioners.

neomatrix369 commented 3 years ago

@loopyme I'll be happy to hear your feedback on the work done via this issue, please let me know how I can answer your questions and clarify any doubts.

I have tried to build this library from ground-up paying attention to the cohesive modules and structure of the library as a whole.

neomatrix369 / nlp_profiler

Able to run at scale: handle larger datasets #2