rrmerugu-archive / webpage-reader

Reads a webpage and extracts the information like SEO tags, headings, urls based on HTML5 tags and standard styling frameworks
MIT License
0 stars 0 forks source link

extract the text of each element and create n grams and determine tf-idf weights #7

Open rrmerugu opened 6 years ago

rrmerugu commented 6 years ago

each element, eg.

<h1>Cloud Computing for Drug Discovery</h1>
<p>Drug Discovery is a science of lorem ipusum a 

each element should be iterated till the last child element and text of each element should be processed to create the pool of ngrams and then calc the tf-idf weights.

rrmerugu commented 6 years ago

this can be achieved using beautifulsoup4 or lxml modules. https://stackoverflow.com/questions/830997/using-beautiful-soup-how-do-i-iterate-over-all-embedded-text