Propose and implement ways to collect more messages

loklak / loklak_server

Distributed Open Source twitter and social media message search server that anonymously collects, shares, dumps and indexes data http://api.loklak.org

GNU Lesser General Public License v2.1

1.38k stars 223 forks source link

Propose and implement ways to collect more messages #939

Open mariobehling opened 7 years ago

mariobehling commented 7 years ago

Improve loklak harvester: The current way messages are collected at points is slow. Please propose ways to increase message and tweet collection. Also consider the implementation of an "aggressive mode", that could be switched on/off by an admin.

yukiisbored commented 7 years ago

We can start by having multiple "Harvesting strategies", basically have multiple types of ways to do org.loklak.Harvester. It's kinda like a CPU scheduler, but not quite. The reason to have multiple Harvesting strategies instead of just one strategy is because everyone's machine is different. One guy is probably running this on his main computer so we don't want Loklak to intefere with his activites and another guy is running it on a dedicated server which they don't mind having Loklak squeezing out the resources of it.

mariobehling commented 7 years ago

@yukiisbored I think it is not only about the resources of the server, but also about the number of connections to sources like twitter. Services might also block servers that make a lot of requests, but if many requests come from cloud servers, that share IPs e.g. Bluemix, that should not be a problem. So, maybe defining different modes like home mode, laptop mode, server mode and cloud mode could be an approach. What do you think?

smokingwheels commented 7 years ago

I the past if I wanted to Harvester to collect more information I would recalculate the DoS settings divided by the number of FrontEnds you are running.

settings to prevent DoS Setting as per API safe limits of Twitter for a frontend. DoS.blackout = 100 DoS.servicereduction = 1000

So if you had 100 FrontEnds your BackEnd setting would need to be in the order of. settings to prevent DoS DoS.blackout = 0 DoS.servicereduction = 10 to 0??

A Problem arises when your BackEnd pushes the info to Loklak.org it will give you lots of Error 503's or you forget to turn it off also possibly get you banned. Cheers @Orbiter. I will be more careful of my settings now on.

I guess this this is by Design at this stage of the project, I have no problem with that.
Also its optional if you ask all your front ends to do the work instead of your BackEnd it reduces load no-end.

Any Comments welcome.

smokingwheels commented 7 years ago

After testing lots of HTTP transfers today there is bandwidth throttling going on from my ISP/internet. There are ways around it. I do not wish to make them Public though. In early testing I performed it was possibly shown that the FrontEnds collect more data in total than the BackEnd when running close to limits. Not sure if this is any help?

` import java.io.; import java.net.;

public class wget {
  public static void main(String[] args) throws Exception {
    String s;
    BufferedReader r = new BufferedReader(new InputStreamReader(new URL(args[0]).openStream()));
    while ((s = r.readLine()) != null) {
        System.out.println(s);
    }
  }
}`

If my maths are correct 100 FrontEnds could generate 3 TB a day in traffic harvesting 150 million tweets.