Simplify Twitter scraper with jsoup

loklak / loklak_server

Distributed Open Source twitter and social media message search server that anonymously collects, shares, dumps and indexes data http://api.loklak.org

GNU Lesser General Public License v2.1

1.38k stars 223 forks source link

Simplify Twitter scraper with jsoup #1083

Open yukiisbored opened 7 years ago

yukiisbored commented 7 years ago

Currently, TwitterScraper.java is really complex and hard to maintain. This is due to the fact we're doing manual string indexing to extract the important data from the HTML page of Twitter. We can use the jsoup library to simplify this because it provides simple selectors and can manipulate HTML data easily.

jigyasa-grover commented 7 years ago

@yukiisbored +1. The other scrapers were indeed implemented using JSoup last summer and seems to be a good option !

sudheesh001 commented 7 years ago

@jig08 let's not implement them in JSoup, all of that code needs to be refactored. Will review this patch later in the day.

yukiisbored commented 7 years ago

Well, it's better to do it to increase maintainability. Currently, only small fraction of us understands TwitterScraper. If this is done, it'll increases Loklak's maintainability. Also, not a lot of code have to be refactored, we can always keep the same methods available on TwitterScraper and result in the same thing but more maintainable.

Orbiter commented 7 years ago

This is a typical "It's already ready but I know how to do this better"-Task. Maybe jsoup is better, but maybe jsoup is not flexible enough for us to respond on twitters changes all the time.

However, lets give it a try. Plug in your code, but please rename/move the current code to a 'LegacyScraper' class. Please also consider to maintain your code in the future when twitter changes their format again.

vibhcool commented 7 years ago

IMHO, tools like JSoup are less flexible than using regex. If there is some change in twitter html , simply regex is needed to be modified, but if we use JSoup, we may have to choose between it's methods to best suite the new changes in html.

I think Code refractoring is needed in:-

1) TwitterScraper's methods are needed be divided into simpler blocks of methods 2) and some changes in data input step in TwitterTweet object.

yukiisbored commented 7 years ago

@vibhcool isn't that the same? Not to mention regex has a pretty rough learning curve for everyone while JSoup is a simple API that you can just switch the methods around and change the parameters to make it fit.

It's like making a shirt with a sewing machine vs without one. Without a sewing machine gives you more flexibility but with a sewing machine you can work faster and fix problems without boggling your head around.

Also about the code refactor that's the whole loklak code base needed to fix.

vibhcool commented 7 years ago

@yukiisbored , haha, nice example, I am just a newbie. I am just exploring :)

What I thought is that :-

what happens In case of TwitterScraper, BufferReader returns a line at a time (maybe because twitter results are fetched by ajax, so it's better to fetch html line by line from connection object ) is processed by the code to extract data.

JSoup generally uses DOM modelling which analyses html and traverses as tree in backend. Yep, it has features like analysing broken html. And If we stores all html and then process it, the html data is very long of about 10 thousands of lines long.

So seeing this regex appears to be better option.

And there is a long debate going on stackoverflow why regex is not better, but in this situation it looks better option to me. see first answer -> http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags And this is pretty interesting :)

sudheesh001 commented 7 years ago

JSoup is terrible with respect to its performance at the same time, there's so much modelling you have to do in case you're trying to parse websites which have dynamically loading content, like tracking the postages #344 for example.

Lets stick to the regex, a learning curve isn't really a bad thing, a bad library on the other hand is worse. If there are better and faster parsers eg. parsley that'd make sense.

mariobehling commented 7 years ago

I am seeing a lot of pros and cons here. What should be done about this? Is this a priority now or should be followed up later?

sudheesh001 commented 7 years ago

I believe we should follow up on this later if we experience a lot of breaking code after having multiple quality scrapers like the twitter scraper. As of now this doesn't seem like a major issue atleast to me.