Open lakonis opened 1 year ago
Hello,
I tried reproducing the error by installing the latest version of minet
and scraping some tweets, but I don't encounter the same errors.
Are you using the latest versions of minet
and twitter-explorer
?
Note that additional fields in the csv should not change the behaviour of the twitter-explorer
, as long as the following columns exist: https://github.com/pournaki/twitter-explorer/blob/master/twitterexplorer/constants.py
That's good news if you encounter no error !
I use minet 0.67.1
and twitterexplorer 0.6.7
. Correct ?
If that can help : I realize that the last line of the second error message above was not included in my copy&paste. Here it is : ValueError: invalid literal for int() with base 10: 'https://pbs.twimg.com/profile_images/606208123997519872/tISV9nnC_normal.jpg'
. It suggests that an integer
was expected instead of a url
?
What else should I explore ? I also have a warning message about caching, but I am only using very small dataset at the moment, and I should not go further than 2-3K tweets anyway.
st.cache
is deprecated. Please use one of Streamlit's new caching commands,st.cache_data
orst.cache_resource
.
I use
minet 0.67.1
andtwitterexplorer 0.6.7
. Correct ?
yes
Did you do anything to your csv
between collecting it to minet
and dropping it into ~/twitterexplorer/data/
? Like, concatenating it or else?
If possible, please send me an example file that generates the error via mail pournaki[at]mis.mpg.de
I figured out that using minet twitter scrape
subcomand with a csv query file adds 2 columns to the csv output : id
and query
. Therefore, it ends up with 2 id
columns.
xsv headers tweets.csv
1 id
2 query
3 id
4 timestamp_utc
5 local_time
...
I tried that to remove the 2 unwanted columns (especially the first id
):
df = pd.read_csv('./tweets.csv')
df1 = df.drop(df.columns[[0, 1]],axis = 1)
df1 = df1.rename(columns={'id.1': 'id'}) # since Panda refers to the second 'id' as 'id.1'
df1 = df1[['id','timestamp_utc','user_screen_name','lang','to_username','to_userid','to_tweetid','user_id','user_name','user_followers','user_friends','retweeted_id','retweeted_user','retweeted_user_id','quoted_id','quoted_user','quoted_user_id','mentioned_ids','mentioned_names','hashtags','collected_via']] # reducing only to the necessary
filepath = Path('/home/nicolas/twitterexplorer/data/tweets-twitwi.csv')
filepath.parent.mkdir(parents=True, exist_ok=True)
df1.to_csv(filepath)
The output csv is loaded into twitter-explorer
, although nothing is displayed:
and if I try to generate a graph, it gives :
ValueError: max() arg is an empty sequence
Traceback:
File "/home/nicolas/.local/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script
exec(code, module.__dict__)
File "/home/nicolas/.local/lib/python3.10/site-packages/twitterexplorer/apps/visualizer.py", line 208, in <module>
G.reduce_network(giant_component=True,
File "/home/nicolas/.local/lib/python3.10/site-packages/twitterexplorer/networks.py", line 147, in reduce_network
G = G.components(mode="weak").giant()
File "/home/nicolas/.local/lib/python3.10/site-packages/igraph/clustering.py", line 429, in giant
max_size = max(ss)
OK, I think I know what is going on. pandas
has problems when a column exists twice, so I need to fix that on my end.
As for the error you get when you generate the graph, this is expected behaviour because you probably tried to generate a retweet network from scraped tweets. Scraping will not return retweets, so you can build all the networks except for retweet networks :) I should adjust the error message to make that clearer!
Indeed ! sorry I stopped on the default retweets network. It works with the previous workaround (removing extra id column).
For your information, I first based my list on twitwi_schema
from constants.py
, but it misses the 'collected_via'
field, which is present in cols_to_load
.
Hello, I have an several errors using csv produced by the CLI minet twitter command.
I followed those instructions to remove additional fields used by the scrap subcommand, and the csv should now be compliant to twitwi format.
Testing with different dataset, I have following errors :
dataset 1: the csv is correctly ingested, Visualizer presents correct tweet's distribution in time, but after that it gives
KeyError: '13'
and displays no further graphics.Traceback:
ValueError: invalid literal for int() with base 10: 'https://pbs.twimg.com/profile_images/606208123997519872/tISV9nnC_normal.jpg'
with traceback:Any idea how to debug twitwi input files ?
Thanks !