Open vidyap-xgboost opened 4 years ago
Hey @vidyap-xgboost, my apologies but I'm not sure of having understood your point and what are your suggestions. Can you please reformulate it a bit and explain how would you "I would suggest that a bar plot be returned which takes the top_words as input."
For you to know the visualization.py
module is the one that convinces me the less. It's very opinionated and not extremely useful; for instance, the function top_words
might not be even necessary as this is simple: hero.tokenize(s).explode().value_counts()
.
Also, I noticed that you are very good at writing and interested in helping with the documentation. One idea might be to add a blog-post regarding "How to use Pandas on NLP and text mining tasks". Would you be interested in helping writing such an article? I have quite a lot of ideas and I can help you formulate that if you would like! 🥳
Hey @vidyap-xgboost, my apologies but I'm not sure of having understood your point and what are your suggestions. Can you please reformulate it a bit and explain how would you "I would suggest that a bar plot be returned which takes the top_words as input."
For you to know the
visualization.py
module is the one that convinces me the less. It's very opinionated and not extremely useful; for instance, the functiontop_words
might not be even necessary as this is simple:hero.tokenize(s).explode().value_counts()
.Also, I noticed that you are very good at writing and interested in helping with the documentation. One idea might be to add a blog-post regarding "How to use Pandas on NLP and text mining tasks". Would you be interested in helping writing such an article? I have quite a lot of ideas and I can help you formulate that if you would like!
What I meant by a bar plot
is that, other functions in visualization.py
has a scatter plot
and a word cloud
for visualization, but top_words
is a bit odd because it returns only a Pandas series instead of a bar plot as 'visualization'. So my suggestion was, why not go ahead and add a bar plot for these top_words? Or was the function originally supposed to return only the series?
And yes, it makes a lot of sense not even have this function as you rightly pointed out but I'm supposing this was added as 'feature' under visualization.
--
Thanks for noticing about my writing :smile: !
I am really interested in contributing blog articles. May I ask where these articles will be hosted?
I'm also searching for ideas to contribute to examples.
I see! Something useful to keep in mind is that Pandas is super powerful and it already allows for this kind of visualization. For instance, given a corpus, if we want to look at the 10 most common words as a bar plot, we can simply do:
s = pd.Series( ...cool corpus...)
hero.top_words(s)[:10].plot.bar()
That's it! Awesome, isn't?
The point again is probably that we need to explain to users some useful tricks on how to deal with text-dataset with Pandas... so, what we can really do is to create some articles that explain all of this things (I would have love it to have these 1 year ago for instance ... )
Great you are interested in contributing to blog articles! They will appear there: texthero.org/blog. The way you do that (it will change a bit int the future (#40 ) but this is not big trouble) is that you write the article in a markdown format and then you add it under website/blog
. See 2017-10-24-texthero-welcome.md as an example.
If you are motivated and want to go further, I'm looking for someone that is willing to supporting me managing the whole documentation of Texthero. This include:
If you want to engage yourself further and are interested, I can assign you the role of "documentation maintainers"! 📝 ⚡ (is quite interesting I would say as you will have to Peer Review all the PR related to the documentation, as well as organize the documentation, as well as helping other users learn better!)
@jbesomi I'd be more than happy to contribute to the documentation of TextHero in every aspect and making it easier for others to use and understand this awesome library!
It will be really helpful if you give me some basic pointers as to what a 'documentation maintainer' should be doing and any guidelines they should be following while peer-reviewing PRs.
I can start working on a blog post if you have any suggestions.
Great then! I'm glad to receive your help!
It will be really helpful if you give me some basic pointers as to what a 'documentation maintainer' should be doing and any guidelines they should be following while peer-reviewing PRs.
Unfortunately, I don't have a complete response yet. Some thoughts:
Text preprocessing
page and try to create a good skeleton. The first natural question is: which dataset do we want to use to show how text preprocessing work? The BBCSport one might not be perfect but it might still workDOCUMENTING.md
the file that explains that 1) we need help in the documentation and 2) How a user should contribute to documenting. Such a file will be similar to CONTRIBUTING.md
but fully focused on contributing on the documentation.issues
labeled documentation
Let me know your opinion! 👍
@jbesomi Thank you for explaining everything!
documentation
related issues/PRs. -- As for #94 , Edit on GitHub button is more of HTML/CSS task right?
From the documentation, "visualization" would mean listing the top_words as a Pandas Series grouped by
topics
, as shown in this example:So, if I have a dataset without any topics, then I would just get a Pandas Series of top_words which is not a "visualization". For this, I would suggest that a bar plot be returned which takes the top_words as input.
visualization.py