[Question] Resume Scraping Using Configuration

pbabvey commented 5 years ago

Python version is 3.6;
Updated Twint with pip3 install --user --upgrade -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint;
I have searched the issues and there are no duplicates of this issue/question/request.

Description of Issue

I am trying to use Resume for a large data and get the error below:

Traceback (most recent call last):
File "/Users/pouria/Dropbox/Programming/Tweet_alanysis/6-month data/code/downloader.py", line 53, in <module>
getTweets("@realDonaldTrump", start_date, end_date, 'w')
File "/Users/pouria/Dropbox/Programming/Tweet_alanysis/6-month data/code/downloader.py", line 36, in getTweets
twint.run.Search(replies)
File "/anaconda3/lib/python3.7/site-packages/twint/run.py", line 292, in Search
run(config, callback)
File "/anaconda3/lib/python3.7/site-packages/twint/run.py", line 213, in run
get_event_loop().run_until_complete(Twint(config).main(callback))
File "/anaconda3/lib/python3.7/site-packages/twint/run.py", line 18, in __init__
self.init = self.get_resume(config.Resume)
File "/anaconda3/lib/python3.7/site-packages/twint/run.py", line 47, in get_resume
if not os.path.exists(resumeFile):
File "/anaconda3/lib/python3.7/genericpath.py", line 19, in exists
os.stat(path)
OverflowError: fd is greater than maximum

Here is my code:

def getTweets(username, begin, end, mode):
    replies = twint.Config()
    replies.To = username
    replies.Since = begin
    replies.Until = end
    replies.Lang = 'en'
    replies.Store_json = True
    replies.Output = '../blahblah'
    replies.Resume = 1101630308519235585
    replies.Debug = True
    replies.Hide_output = True
    twint.run.Search(replies)
start_date = "2019-03-01"
end_date = "2019-03-02"
getTweets("@realDonaldTrump", start_date, end_date, 'w')

It's impossible to play with date, unless we have the option to give more specific time to Since filed to start from where we left off.

Environment Details

I used PyCharm.

pielco11 commented 5 years ago

replies.Resume needs to be a file name. Like replies.Resume = 'resume_file.txt'

pbabvey commented 5 years ago

Thank you. I tried it. Now, it works, but it resumes collecting tweet from 4296th tweet of the file, while my JSON file contains almost 71000 tweets. Is there any limit on the number of items in a file?

epanmareza commented 5 years ago

when I did g.Resume = 'filename.csv' and twint.run.Search(g), why i got UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7965: character maps to ??

pielco11 commented 5 years ago

@pbabvey every request has its own resume ID. For every request, first the resume ID is placed in the resume file and then the request is handled, so if something breaks you still have the ID of the request to resume

At every new request, the old id is deleted and replaced with the new one

@epanmareza it seems that you were not able to decode a char, might be an issue on your end

ghost commented 4 years ago

Apologies if this is a dumb question, but is the "resume" file the twint-last-request.log file saved by the debugger? And if not, how do I find or create a resume file?

Edit: Ah, I think I figured it out - you specify the filename in c.Resume = .

pielco11 commented 4 years ago

@jomorrcode

import twint

c = twint.Config()
c.Username = "target"
c.Limit = 20
c.Resume = "target_resume.raw"

twint.run.Search(c)

Now if you run this script twice, you'll resume from where Twint stopped (assuming that there are more tweets to scrape)

mnwato commented 3 years ago

I need to resume from the last downloaded data but I couldnt do that.

I read that it needs to add last scrol_id to >>> config.resume

found this one in url log file: scroll%3AthGAVUV0VFVBaEgLeJ1_GV7yQWgsC79bHLzpglEjUAFQAlABEV3Lp5FYCJehgHREVGQVVMVBUAFQAVARUIFQAA from the above code I used this part : thGAVUV0VFVBaEgLeJ1_GV7yQWgsC79bHLzpglEjUAFQAlABEV3Lp5FYCJehgHREVGQVVMVBUAFQAVARUIFQAA

Here is the full code:

import twint
c = twint.Config()
c.Search = "gold"
c.Store_csv = "True"
c.Output = "none.csv"
c.Lang = "en"
c.Debug = "True"
twint.run.Search(c)

After the error ocurred I add this line of code : c.Resume = "thGAVUV0VFVBaEgLeJ1_GV7yQWgsC79bHLzpglEjUAFQAlABEV3Lp5FYCJehgHREVGQVVMVBUAFQAVARUIFQAA" but the program start downloading the whole data from now .

the question is that : Is the scrol_id which i used correct? or it needs to be use in other format?

Thanks in advanced

mnwato commented 3 years ago

mnwato

I found that it needs to set a file as input for Resume and every time it will update the last id

But I have a suggestion:

because there are lots of tweets which sent per seconds so it will be great if "Since" and "Until" contains datetime not just date. If there is now avaible please tell me.

Thank for all of you for this great project

twintproject / twint

[Question] Resume Scraping Using Configuration #514

Description of Issue

Environment Details