jbesomi / texthero

Text preprocessing, representation and visualization from zero to hero.
https://texthero.org
MIT License
2.89k stars 239 forks source link

The word "visualize" is misleading for hero.top_words #93

Open vidyap-xgboost opened 4 years ago

vidyap-xgboost commented 4 years ago

From the documentation, "visualization" would mean listing the top_words as a Pandas Series grouped by topics, as shown in this example:

Screenshot from 2020-07-15 16-21-28

So, if I have a dataset without any topics, then I would just get a Pandas Series of top_words which is not a "visualization". For this, I would suggest that a bar plot be returned which takes the top_words as input.

visualization.py

jbesomi commented 4 years ago

Hey @vidyap-xgboost, my apologies but I'm not sure of having understood your point and what are your suggestions. Can you please reformulate it a bit and explain how would you "I would suggest that a bar plot be returned which takes the top_words as input."

For you to know the visualization.py module is the one that convinces me the less. It's very opinionated and not extremely useful; for instance, the function top_words might not be even necessary as this is simple: hero.tokenize(s).explode().value_counts().

Also, I noticed that you are very good at writing and interested in helping with the documentation. One idea might be to add a blog-post regarding "How to use Pandas on NLP and text mining tasks". Would you be interested in helping writing such an article? I have quite a lot of ideas and I can help you formulate that if you would like! 🥳

vidyap-xgboost commented 4 years ago

Hey @vidyap-xgboost, my apologies but I'm not sure of having understood your point and what are your suggestions. Can you please reformulate it a bit and explain how would you "I would suggest that a bar plot be returned which takes the top_words as input."

For you to know the visualization.py module is the one that convinces me the less. It's very opinionated and not extremely useful; for instance, the function top_words might not be even necessary as this is simple: hero.tokenize(s).explode().value_counts().

Also, I noticed that you are very good at writing and interested in helping with the documentation. One idea might be to add a blog-post regarding "How to use Pandas on NLP and text mining tasks". Would you be interested in helping writing such an article? I have quite a lot of ideas and I can help you formulate that if you would like!

What I meant by a bar plot is that, other functions in visualization.py has a scatter plot and a word cloud for visualization, but top_words is a bit odd because it returns only a Pandas series instead of a bar plot as 'visualization'. So my suggestion was, why not go ahead and add a bar plot for these top_words? Or was the function originally supposed to return only the series?

And yes, it makes a lot of sense not even have this function as you rightly pointed out but I'm supposing this was added as 'feature' under visualization.

--

Thanks for noticing about my writing :smile: !

I am really interested in contributing blog articles. May I ask where these articles will be hosted?

I'm also searching for ideas to contribute to examples.

jbesomi commented 4 years ago

I see! Something useful to keep in mind is that Pandas is super powerful and it already allows for this kind of visualization. For instance, given a corpus, if we want to look at the 10 most common words as a bar plot, we can simply do:

s = pd.Series( ...cool corpus...)
hero.top_words(s)[:10].plot.bar()

That's it! Awesome, isn't?

The point again is probably that we need to explain to users some useful tricks on how to deal with text-dataset with Pandas... so, what we can really do is to create some articles that explain all of this things (I would have love it to have these 1 year ago for instance ... )

Great you are interested in contributing to blog articles! They will appear there: texthero.org/blog. The way you do that (it will change a bit int the future (#40 ) but this is not big trouble) is that you write the article in a markdown format and then you add it under website/blog. See 2017-10-24-texthero-welcome.md as an example.

If you are motivated and want to go further, I'm looking for someone that is willing to supporting me managing the whole documentation of Texthero. This include:

  1. Getting-started guides
  2. Blog articles under /blog
  3. All extra markdown files (README, CONTRIBUTING.md, PURPOSE.md, _newfiles.md)

If you want to engage yourself further and are interested, I can assign you the role of "documentation maintainers"! 📝 ⚡ (is quite interesting I would say as you will have to Peer Review all the PR related to the documentation, as well as organize the documentation, as well as helping other users learn better!)

vidyap-xgboost commented 4 years ago

@jbesomi I'd be more than happy to contribute to the documentation of TextHero in every aspect and making it easier for others to use and understand this awesome library!

It will be really helpful if you give me some basic pointers as to what a 'documentation maintainer' should be doing and any guidelines they should be following while peer-reviewing PRs.

I can start working on a blog post if you have any suggestions.

jbesomi commented 4 years ago

Great then! I'm glad to receive your help!

It will be really helpful if you give me some basic pointers as to what a 'documentation maintainer' should be doing and any guidelines they should be following while peer-reviewing PRs.

Unfortunately, I don't have a complete response yet. Some thoughts:

  1. The general idea would be to enhance the documentation in any way. Imagine you are completely new to Texthero and you open the website https://texthero.org. Quickly, you want to understand what is Texthero about (homepage goal) and then you move under "getting-started" to learn how does it work. Now, we should focus on the getting-started page. I expect to have at least four pages:
    1. Getting started (is already there)
    2. Text preprocessing
    3. Text representation
    4. Text visualization
  2. For now, we can concentrate for instance on the Text preprocessing page and try to create a good skeleton. The first natural question is: which dataset do we want to use to show how text preprocessing work? The BBCSport one might not be perfect but it might still work
  3. For creating the documents, we might want to use Sphinx-Gallery instead of the markdown. Look for instance at this PyTorch tutorial: TEXT CLASSIFICATION WITH TORCHTEXT. If you look at the source code you will see that this is just a python file where comments have some markdown flavors. If you like this idea, we might want to introduce that as the default approach. One of the advantages of this system is that we don't have to create separately the image. After having understood how it works, the next step would be to add an issue saying we should work on that. Later on, we will have to explain to other contributors interested in improving/create new documents on how everything works.
  4. Related to that, probably we will need to write a DOCUMENTING.md the file that explains that 1) we need help in the documentation and 2) How a user should contribute to documenting. Such a file will be similar to CONTRIBUTING.md but fully focused on contributing on the documentation.
  5. "Guidelines they should be following while peer-reviewing PRs": you probably will learn this as time goes on! We can, for instance, review the first PRs together and have an exchange of opinions
  6. Improve/Complete/Order/Comment/Assign all issues labeled documentation
  7. We will soon move from Docusaurus to Sphinx. Have a look at this #40. Having a general idea of what Sphinx does (create the API documentation from the docstring) is for sure useful.
  8. (Edit) edit and keep tracks of #94

Let me know your opinion! 👍

vidyap-xgboost commented 4 years ago

@jbesomi Thank you for explaining everything!

  1. I can start looking for some datasets that could explain preprocessing in a better way rather than using BBCSport. I will update in #94 once I find something and we can exchange our thoughts on it.
  2. I had a glance at the links you provided and I would like to get familiar with these since I'm a noob and need some time to understand them. I prefer to understand them from the ground-level so I don't miss out on any important points (which might take time). I'm hoping more people would join TextHero to help and understand this process better for beginners.
  3. Once point 3 is done, we can start DOCUMENTING.md
  4. Sure, thanks! I agree.
  5. I'm keeping track of all documentation related issues/PRs.

-- As for #94 , Edit on GitHub button is more of HTML/CSS task right?