twintproject / twint

An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.
MIT License
15.66k stars 2.72k forks source link

Help request: Twint -> formating and filtering csv giving error messages #1229

Open Felixkrell opened 3 years ago

Felixkrell commented 3 years ago

Command Ran

import twint import pandas import nest_asyncio nest_asyncio.apply()

c = twint.Config() c.Search = "EURO2020" c.Since = "2021-06-19" c.Until = "2021-06-20" c.Custom = ['time','username','tweet','likes','link'] ## <-- This is the problem line c.Store_csv = True c.Output = "Test_EM2020"

twint.run.Search(c)

Description of Issue

I am very new to all of this, lets just say that as a disclaimer. I want to scrape all tweets that happen during specific EURO 2020 football games. I tried my best to get everything working, but I have no idea what I am doing to be honest. the pandas, and asyncio line I got form a YouTube Channel, it doesn't seem to be working if I leave them out, so I kept them.

The general Scraping seems to work fine, but there are several things that I want to do that give me error messages:

  1. When I try to slim the csv down by only scraping for the stuff i need (not much, see my c.Custom), I get the message "CRITICAL:root:twint.output:_output:CSV:Error:list indices must be integers or slices, not str list indices must be integers or slices, not str". I have no idea what that means, but I am confused because I am not asking anything new by my wanted parameters - they are in the csv when i do not filter, so it doesn't make sense to me why I get an error message when I look for less attributes.

  2. I really want to keep only Englisch and German messages, but the c.Lang = "de" line doesn't seem to be doing anything, same as the Englisch one. I also do not know how I can combine the two, so looking only for "en" and "de"

  3. Right now what I still need to do is go into Excel and re-format the first column with the seperators. That does work, but is an extra step that I am shure can be automated somehow - I just don't know how.

Any help on this would be greatly appreciated. I don't know if this is the right place to ask, but I found no "discussion" forum or something of the sort. Thanks for reading anyways!

Environment Details

Windows 10, Anaconda and Jupyter notebook

batmanscode commented 3 years ago

Hi @Felixkrell

Could you try this and see if it works?

import twint
import nest_asyncio

nest_asyncio.apply()

c = twint.Config()

c.Search = "EURO2020"
c.Since = "2021-06-19"
c.Until = "2021-06-20"

c.Custom["tweet"] = ["timestamp", "username", "likes_count", "link"]
c.Output = "Test_EM2020.csv" # optional, to save as a csv
c.Store_csv = True

twint.run.Search(c)

Where to look in the docs

I started using this recently as well and had some trouble understanding the documentation. I hope this helps! 😊

For the custom line, I got the format and customizable attributes from: https://github.com/twintproject/twint/wiki/Tweet-attributes

Example

image

Attributes

image

Felixkrell commented 3 years ago

omg thank you so much! It works a lot better now!

I still can't filter out languages with the c.Lang command, and emmojis and special signs like ä, ö, üs only display in unicode, but I don't know if that can even be fixed with twint or in jupyter notebook or if it is an excel issue.

It is workeable now though, so thanks a ton!

batmanscode commented 3 years ago

Glad it works @Felixkrell! 😃

Ah I am having a similar issue. I wanted to translate all tweets to one language but it seems the source language would have to specified as well but that won't really work since there will be multiple source languages. See #1234 (hah the number sequence is satisfying!) for a bit more info.

Special signs and emoji should work just fine in a notebook, so that may be excel. You can remove strings (including emoji, probably) with pandas. For example you can filter out all links with something like links = df[df['message'].str.contains("https://")]. If you know regex, you can use that as well with this method. See their docs or this tutorial for more info on contains().