Error using twitwi format

lakonis commented 1 year ago

Hello, I have an several errors using csv produced by the CLI minet twitter command.

I followed those instructions to remove additional fields used by the scrap subcommand, and the csv should now be compliant to twitwi format.

Testing with different dataset, I have following errors :

dataset 1: the csv is correctly ingested, Visualizer presents correct tweet's distribution in time, but after that it gives KeyError: '13' and displays no further graphics.

Traceback:

File "/home/nicolas/.local/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script
    exec(code, module.__dict__)
File "/home/nicolas/.local/lib/python3.10/site-packages/twitterexplorer/apps/visualizer.py", line 93, in <module>
    langbars = plot_tweetlanguages(df)
File "/home/nicolas/.local/lib/python3.10/site-packages/twitterexplorer/plotting.py", line 87, in plot_tweetlanguages
    langcounts['language'] = langcounts['language_code'].apply(lambda x: iso_to_language[x])
File "/usr/lib/python3.10/site-packages/pandas/core/series.py", line 4771, in apply
    return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
File "/usr/lib/python3.10/site-packages/pandas/core/apply.py", line 1123, in apply
    return self.apply_standard()
File "/usr/lib/python3.10/site-packages/pandas/core/apply.py", line 1174, in apply_standard
    mapped = lib.map_infer(
File "pandas/_libs/lib.pyx", line 2924, in pandas._libs.lib.map_infer
File "/home/nicolas/.local/lib/python3.10/site-packages/twitterexplorer/plotting.py", line 87, in <lambda>
    langcounts['language'] = langcounts['language_code'].apply(lambda x: iso_to_language[x])

Dataset 2 : doesn't load the dataset and gives the following error : ValueError: invalid literal for int() with base 10: 'https://pbs.twimg.com/profile_images/606208123997519872/tISV9nnC_normal.jpg' with traceback:

<body><div id="root"><div class=""><div class="withScreencast"><div tabindex="-1"><div class="stApp stAppEmbeddingId-rri9lx8isiht css-fg4pbf eczokvf1"><div class="appview-container css-1wrcr25 egzxvld6" data-testid="stAppViewContainer" data-layout="narrow"><section tabindex="0" class="main css-uf99v8 egzxvld5"><div class="block-container css-1y4p8pa egzxvld4" style="position: relative;"><div style="overflow: visible; width: 0px; display: flex; flex-direction: column; flex: 1 1 0%;"><div width="704" data-testid="stVerticalBlock" class="css-1n76uvr e1tzin5v0"><div data-stale="false" width="704" class="element-container css-1hynsf2 e1tzin5v3"><div class="stException"><div role="alert" data-baseweb="notification" class="st-ae st-af st-ag st-ah st-ai st-aj st-ak st-e8 st-am st-ba st-an st-ao st-ap st-aq st-ar st-as st-e9 st-au st-av st-aw st-ax st-ay st-bb st-b0 st-b1 st-b2 st-b3 st-b4 st-b5 st-b6 st-b7"><div class="st-b8 st-b9"><div class="css-wmn9kq e13vu3m50"><pre class="css-nps9tx e1m3hlzs0"><code><div class="css-1mrdis4 e1m3hlzs3">File "/home/nicolas/.local/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script
    exec(code, module.__dict__)</div><div class="css-1mrdis4 e1m3hlzs3">File "/home/nicolas/.local/lib/python3.10/site-packages/twitterexplorer/apps/visualizer.py", line 81, in &lt;module&gt;
    df = load_data(filename)</div><div class="css-1mrdis4 e1m3hlzs3">File "/home/nicolas/.local/lib/python3.10/site-packages/streamlit/runtime/legacy_caching/caching.py", line 715, in wrapped_func
    return get_or_create_cached_value()</div><div class="css-1mrdis4 e1m3hlzs3">File "/home/nicolas/.local/lib/python3.10/site-packages/streamlit/runtime/legacy_caching/caching.py", line 696, in get_or_create_cached_value
    return_value = non_optional_func(*args, **kwargs)</div><div class="css-1mrdis4 e1m3hlzs3">File "/home/nicolas/.local/lib/python3.10/site-packages/twitterexplorer/apps/visualizer.py", line 71, in load_data
    df = pd.read_csv(path,</div><div class="css-1mrdis4 e1m3hlzs3">File "/usr/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)</div><div class="css-1mrdis4 e1m3hlzs3">File "/usr/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)</div><div class="css-1mrdis4 e1m3hlzs3">File "/usr/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
    return _read(filepath_or_buffer, kwds)</div><div class="css-1mrdis4 e1m3hlzs3">File "/usr/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 611, in _read
    return parser.read(nrows)</div><div class="css-1mrdis4 e1m3hlzs3">File "/usr/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1778, in read
    ) = self._engine.read(  # type: ignore[attr-defined]</div><div class="css-1mrdis4 e1m3hlzs3">File "/usr/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 235, in read
    data = self._reader.read(nrows)</div><div class="css-1mrdis4 e1m3hlzs3">File "pandas/_libs/parsers.pyx", line 790, in pandas._libs.parsers.TextReader.read</div><div class="css-1mrdis4 e1m3hlzs3">File "pandas/_libs/parsers.pyx", line 890, in pandas._libs.parsers.TextReader._read_rows</div><div class="css-1mrdis4 e1m3hlzs3">File "pandas/_libs/parsers.pyx", line 1037, in pandas._libs.parsers.TextReader._convert_column_data</div><div class="css-1mrdis4 e1m3hlzs3">File "pandas/_libs/parsers.pyx", line 1130, in pandas._libs.parsers.TextReader._convert_tokens</div></code></pre></div></div></div></div></div></div></div><div class="resize-triggers"><div class="expand-trigger"><div style="width: 737px; height: 1661px;"></div></div><div class="contract-trigger"></div></div></div><div data-iframe-height="true" class="css-1wrevtn egzxvld0"></div><div class="css-qcqlej egzxvld3"></div><footer class="css-h5rgaw egzxvld1">Made with <a href="http://streamlit.io" target="_blank" class="css-1vbd788 egzxvld2">Streamlit</a></footer></section></div></div></div></div><div id="portal" class="css-1q6lfs0 eczokvf0"></div></div><div class=""></div></div><div id="grammalecte_menu_main_button_shadow_host" style="width: 0px; height: 0px;"></div><div id="vg-tooltip-element" class="vg-tooltip" style="top: 542px; left: 544px;">

  |  
-- | --
  |  

</div></body>File "/home/nicolas/.local/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script
    exec(code, module.__dict__)
File "/home/nicolas/.local/lib/python3.10/site-packages/twitterexplorer/apps/visualizer.py", line 81, in <module>
    df = load_data(filename)
File "/home/nicolas/.local/lib/python3.10/site-packages/streamlit/runtime/legacy_caching/caching.py", line 715, in wrapped_func
    return get_or_create_cached_value()
File "/home/nicolas/.local/lib/python3.10/site-packages/streamlit/runtime/legacy_caching/caching.py", line 696, in get_or_create_cached_value
    return_value = non_optional_func(*args, **kwargs)
File "/home/nicolas/.local/lib/python3.10/site-packages/twitterexplorer/apps/visualizer.py", line 71, in load_data
    df = pd.read_csv(path,
File "/usr/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
File "/usr/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
File "/usr/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
    return _read(filepath_or_buffer, kwds)
File "/usr/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 611, in _read
    return parser.read(nrows)
File "/usr/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1778, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
File "/usr/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 235, in read
    data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 790, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 890, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1037, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 1130, in pandas._libs.parsers.TextReader._convert_tokens

Any idea how to debug twitwi input files ?

Thanks !

pournaki commented 1 year ago

Hello,

I tried reproducing the error by installing the latest version of minet and scraping some tweets, but I don't encounter the same errors.

Are you using the latest versions of minet and twitter-explorer?

Note that additional fields in the csv should not change the behaviour of the twitter-explorer, as long as the following columns exist: https://github.com/pournaki/twitter-explorer/blob/master/twitterexplorer/constants.py

lakonis commented 1 year ago

That's good news if you encounter no error !

I use minet 0.67.1 and twitterexplorer 0.6.7. Correct ?

If that can help : I realize that the last line of the second error message above was not included in my copy&paste. Here it is : ValueError: invalid literal for int() with base 10: 'https://pbs.twimg.com/profile_images/606208123997519872/tISV9nnC_normal.jpg' . It suggests that an integer was expected instead of a url ?

What else should I explore ? I also have a warning message about caching, but I am only using very small dataset at the moment, and I should not go further than 2-3K tweets anyway.

st.cache is deprecated. Please use one of Streamlit's new caching commands, st.cache_data or st.cache_resource.

pournaki commented 1 year ago

I use minet 0.67.1 and twitterexplorer 0.6.7. Correct ?

yes

Did you do anything to your csv between collecting it to minet and dropping it into ~/twitterexplorer/data/? Like, concatenating it or else?

If possible, please send me an example file that generates the error via mail pournaki[at]mis.mpg.de

lakonis commented 1 year ago

I figured out that using minet twitter scrape subcomand with a csv query file adds 2 columns to the csv output : id and query. Therefore, it ends up with 2 id columns.

xsv headers tweets.csv                                                                                           
1   id
2   query
3   id
4   timestamp_utc
5   local_time
...

I tried that to remove the 2 unwanted columns (especially the first id):

df = pd.read_csv('./tweets.csv')
df1 = df.drop(df.columns[[0, 1]],axis = 1)
df1 = df1.rename(columns={'id.1': 'id'}) # since Panda refers to the second 'id' as 'id.1' 
df1 = df1[['id','timestamp_utc','user_screen_name','lang','to_username','to_userid','to_tweetid','user_id','user_name','user_followers','user_friends','retweeted_id','retweeted_user','retweeted_user_id','quoted_id','quoted_user','quoted_user_id','mentioned_ids','mentioned_names','hashtags','collected_via']] # reducing only to the necessary

filepath = Path('/home/nicolas/twitterexplorer/data/tweets-twitwi.csv')  
filepath.parent.mkdir(parents=True, exist_ok=True)  
df1.to_csv(filepath)

The output csv is loaded into twitter-explorer, although nothing is displayed:

and if I try to generate a graph, it gives :

ValueError: max() arg is an empty sequence
Traceback:

File "/home/nicolas/.local/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script
    exec(code, module.__dict__)
File "/home/nicolas/.local/lib/python3.10/site-packages/twitterexplorer/apps/visualizer.py", line 208, in <module>
    G.reduce_network(giant_component=True,
File "/home/nicolas/.local/lib/python3.10/site-packages/twitterexplorer/networks.py", line 147, in reduce_network
    G = G.components(mode="weak").giant()
File "/home/nicolas/.local/lib/python3.10/site-packages/igraph/clustering.py", line 429, in giant
    max_size = max(ss)

pournaki commented 1 year ago

OK, I think I know what is going on. pandas has problems when a column exists twice, so I need to fix that on my end.

As for the error you get when you generate the graph, this is expected behaviour because you probably tried to generate a retweet network from scraped tweets. Scraping will not return retweets, so you can build all the networks except for retweet networks :) I should adjust the error message to make that clearer!

lakonis commented 1 year ago

Indeed ! sorry I stopped on the default retweets network. It works with the previous workaround (removing extra id column).

For your information, I first based my list on twitwi_schema from constants.py, but it misses the 'collected_via' field, which is present in cols_to_load.

pournaki / twitter-explorer

Error using twitwi format #11