taspinar / twitterscraper

Scrape Twitter for Tweets
MIT License
2.41k stars 579 forks source link

incredible memory usage #136

Open isaacimholt opened 6 years ago

isaacimholt commented 6 years ago

Program consumes 5-6 hundred megabytes of ram during execution of unlimited scrape for query results that return many tweets, which is completely unreasonable, no matter the number of tweets. Data should be streamed to file and dumped from memory asap.

lapp0 commented 6 years ago

I did a bit of work fixing this in 0.8.0 through the use of a generator to get tweets in query_tweets_once_generator.

The next step is to use a multiprocessing queue to get the tweets each process generates iteratively.

Finally, sort the csv file using https://pypi.org/project/csvsort/ if some --sorted flag is passed

@taspinar can you give me permission to assign issues so I can keep track of stuff I plan on working on?

isaacimholt commented 6 years ago

I was using this tool through the cli which uses query_tweets but it appears that function wasn't modified.

I tried another twitter scraping library but it returned only approx 1k results, whereas this library returned 8k. I am awaiting an update to query_tweets because this library currently seems to be the one that retrieves the most results with its multiple queries split with dates.

If I could be so forward as to make a suggestion, I am wrapping this library with the following code because the current interface is slightly frustrating, perhaps something similar could be implemented?

import csv
import datetime as dt
from typing import Iterator, NamedTuple, Sequence

from twitterscraper.query import query_tweets_once_generator

class Tweet(NamedTuple):
    """
    Using NamedTuple to save memory while keeping autocomplete & type annotations.

    --- Comparison ---
    Regular named tuple: No attribute access, no type annotation
    Regular class: Uses more memory
    Dictionary: Uses more memory, no attribute access, no type annotation
    """
    user: str
    fullname: str
    tweet_id: str
    timestamp: dt.datetime
    url: str
    likes: int
    replies: int
    retweets: int
    text: str
    html: str

def get_tweets(query: str,
               limit: int=None,
               begin_date: dt.date=dt.date(2006,3,21),
               end_date: dt.date=dt.date.today(),
               pool_size: int=20,
               lang: str='') -> Iterator[Tweet]:

    # todo: use query_tweets once it returns generator
    # todo: does twitter use utc in query? does dt.date use utc? use pendulum?
    # todo: add oldest_first: bool param to get results starting from newest/oldest

    for t, _ in query_tweets_once_generator(
            query=query, limit=limit, lang=lang):

        yield Tweet(
            user=t.user,
            fullname=t.fullname,
            tweet_id=t.id,
            timestamp=t.timestamp,
            url=t.url,
            likes=t.likes,
            replies=t.replies,
            retweets=t.retweets,
            text=t.text,
            html=t.html,
        )

def save_tweets_csv(tweets: Iterator[Tweet],
                    file_name: str='tweets.csv',
                    header_row: Sequence[str]=Tweet._fields) -> None:

    with open(file_name, 'w', newline='') as csv_file:
        # using newline='' corrects empty lines
        writer = csv.writer(csv_file)
        writer.writerow(header_row)
        writer.writerows(tweets)

def save_tweets_json()->None:
    pass

hashtag = '#worldcup'
tweets = get_tweets(hashtag)
save_tweets_csv(tweets)

edit: I think this project targets python 2? if that's the case then using a NamedTuple won't work so I will probably fork it targeting 3.6+ and play around with some newer features for my use case, if I find any interesting bugs I will let you know.

isaacimholt commented 6 years ago

@lapp0 @taspinar I am trying to produce some results for work so I have been researching the query_tweets function to re-write it and get something working quickly & temporarily. I'm not experienced in multiprocessing tasks but I have been doing some research, hopefully you can confirm/refute some of my conclusions:

  1. The primary culprit of memory usage is the fact that all results are returned in a list. This should be a generator. pool.imap_unordered is already an iterator that progressively returns results, so the simplest solution would be to simply yield from new_tweets (or an equivalent such as for t in new_tweets: yield t).

  2. I would assume that pool.imap_unordered is used instead of pool.imap for performance reasons, but this also mixes up the results, meaning tweets are returned out-of-order. I am under the impression that getting all tweets can often not be possible, but I would like to see a best-effort attempt to return tweets in the order they are displayed on twitter.com. I would recommend pool.imap. 2.1 addendum: date queries should probably be reversed to start from most recent to oldest, currently date ranges begin from oldest.

  3. I have been doing a lot of reading about the multiprocessing pool used in python, and it really is interesting. However, I cannot find any source for the claim that # the number of pools should not exceed the number of [jobs]. I would therefore suggest to remove the code that performs that check.

  4. The size of the pool == the # of parallel workers for jobs in the pool. Broadly speaking, all sources I've seen online recommend to set this value to the number of cores on your system, or perhaps even 1 less than this. However, these discussions are always in the context of using the workers for cpu-intensive tasks, which would make sense, but in this case the processes are primarily I/O bound, waiting for http responses, so a higher number is probably warranted. I don't really have any suggestions here, I was just documenting my research ;). I guess that effectively speaking, the pool size directly translates into a kind of "max concurrent connections" to the target server, which may be a better way to describe its effect.

  5. There is a potential logical problem with the way that tweets are retrieved in conjunction with the limit parameter. query_tweets splits a query into a series of date ranges (for some reason this partitioning is a function of the number of workers in the pool, I don't understand why). The limit parameter then is divided by the number of workers in the pool (again, the relevance of the pool size is not clear to me, workers that are released should just pick up a new job). Effectively, what happens is that each date-range query to twitter returns a maximum of limit/pool_size results, which means we are getting the first n results from each of the date ranges within the date ranges we set in the parameters of the function, obtaining a kind of "skimming" of tweets, which is almost certainly not expected behavior. The expected behavior would be that we get back the first limit tweets in the date range provided to the function. I recommend, broadly speaking, to send the full limit parameter to each query_tweets_once call in the pool. Because of the recent work making query_tweets_once a generator, the subsequent pages should not be fetched until the generator is touched. In essence, each separate job will fetch 1 page of results, and then wait. The first job to be pulled will iterate until its internal limit is reached or it is exhausted. The "external" limit can be decremented with each result yielded, and the next job can begin if that limit is not reached. Once the external limit reaches 0, the pool is closed.

  6. edit: http://chriskiehl.com/article/parallelism-in-one-line/ threads can be used instead of processes pool, should be faster & less memory, the interface should be the same so only 1 line should need changing.

lapp0 commented 6 years ago

date queries should probably be reversed to start from most recent to oldest, currently date ranges begin from oldest.

agreed

Your solution may work, but I think it would be much simpler to just use multiprocessing.queue. All the processes can share a queue, and as each process iterates over query_tweets_once_generator and pushes to the queue, and the main process can pop from the queue.