Open yukiisbored opened 7 years ago
@yukiisbored +1. The other scrapers were indeed implemented using JSoup
last summer and seems to be a good option !
@jig08 let's not implement them in JSoup, all of that code needs to be refactored. Will review this patch later in the day.
Well, it's better to do it to increase maintainability. Currently, only small fraction of us understands TwitterScraper. If this is done, it'll increases Loklak's maintainability. Also, not a lot of code have to be refactored, we can always keep the same methods available on TwitterScraper and result in the same thing but more maintainable.
This is a typical "It's already ready but I know how to do this better"-Task. Maybe jsoup is better, but maybe jsoup is not flexible enough for us to respond on twitters changes all the time.
However, lets give it a try. Plug in your code, but please rename/move the current code to a 'LegacyScraper' class. Please also consider to maintain your code in the future when twitter changes their format again.
IMHO, tools like JSoup are less flexible than using regex. If there is some change in twitter html , simply regex is needed to be modified, but if we use JSoup, we may have to choose between it's methods to best suite the new changes in html.
I think Code refractoring is needed in:-
1) TwitterScraper's methods are needed be divided into simpler blocks of methods 2) and some changes in data input step in TwitterTweet object.
@vibhcool isn't that the same? Not to mention regex has a pretty rough learning curve for everyone while JSoup is a simple API that you can just switch the methods around and change the parameters to make it fit.
It's like making a shirt with a sewing machine vs without one. Without a sewing machine gives you more flexibility but with a sewing machine you can work faster and fix problems without boggling your head around.
Also about the code refactor that's the whole loklak code base needed to fix.
@yukiisbored , haha, nice example, I am just a newbie. I am just exploring :)
What I thought is that :-
what happens In case of TwitterScraper, BufferReader returns a line at a time (maybe because twitter results are fetched by ajax, so it's better to fetch html line by line from connection object ) is processed by the code to extract data.
JSoup generally uses DOM modelling which analyses html and traverses as tree in backend. Yep, it has features like analysing broken html. And If we stores all html and then process it, the html data is very long of about 10 thousands of lines long.
So seeing this regex appears to be better option.
And there is a long debate going on stackoverflow why regex is not better, but in this situation it looks better option to me. see first answer -> http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags And this is pretty interesting :)
JSoup is terrible with respect to its performance at the same time, there's so much modelling you have to do in case you're trying to parse websites which have dynamically loading content, like tracking the postages #344 for example.
Lets stick to the regex, a learning curve isn't really a bad thing, a bad library on the other hand is worse. If there are better and faster parsers eg. parsley that'd make sense.
I am seeing a lot of pros and cons here. What should be done about this? Is this a priority now or should be followed up later?
I believe we should follow up on this later if we experience a lot of breaking code after having multiple quality scrapers like the twitter scraper. As of now this doesn't seem like a major issue atleast to me.
Currently,
TwitterScraper.java
is really complex and hard to maintain. This is due to the fact we're doing manual string indexing to extract the important data from the HTML page of Twitter. We can use the jsoup library to simplify this because it provides simple selectors and can manipulate HTML data easily.