Open fishfree opened 1 year ago
@dolsysmith Hi Dolsy, sorry to bother you. I found you upgraded the docker-compose.yml to v 3.0.0. Could you pls have a look at this issue? And could you also pls confirm if your exporter working as expected. Thank you very much!
I notcied thedocker-compose logs -f
ouput:
twitterrestexporter2_1 | 2023-09-20 05:39:49,867: sfmutils.warc_iter --> Iterating over /sfm-collection-set-data/collection_set/5446e7cd861d4a24a759b1f5bdce278e/767b28d78b164cf7b7ca83257fee81d4/2023/09/17/23/5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz
twitterrestexporter2_1 | 2023-09-20 05:39:49,870: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 0 records. Yielded 0 items.
twitterrestexporter2_1 | 2023-09-20 05:39:50,105: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 3 records. Yielded 100 items.
twitterrestexporter2_1 | 2023-09-20 05:39:50,271: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 4 records. Yielded 200 items.
twitterrestexporter2_1 | 2023-09-20 05:39:50,521: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 5 records. Yielded 300 items.
twitterrestexporter2_1 | 2023-09-20 05:39:50,660: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 6 records. Yielded 400 items.
twitterrestexporter2_1 | 2023-09-20 05:39:50,792: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 7 records. Yielded 500 items.
twitterrestexporter2_1 | 2023-09-20 05:39:50,906: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 8 records. Yielded 600 items.
twitterrestexporter2_1 | 2023-09-20 05:39:51,005: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 9 records. Yielded 700 items.
twitterrestexporter2_1 | 2023-09-20 05:39:51,064: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 10 records. Yielded 798 items.
twitterrestexporter2_1 | 2023-09-20 05:39:51,081: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 10 records. Yielded 800 items.
twitterrestexporter2_1 | 2023-09-20 05:39:51,258: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 13 records. Yielded 900 items.
twitterrestexporter2_1 | 2023-09-20 05:39:51,475: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 15 records. Yielded 1000 items.
twitterrestexporter2_1 | 2023-09-20 05:39:52,220: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 20 records. Yielded 1448 items.
twitterrestexporter2_1 | 💔 ERROR: 1 Unexpected items in data!
twitterrestexporter2_1 | Are you sure you specified the correct --input-data-type?
twitterrestexporter2_1 | If the object type is correct, add extra columns with:
twitterrestexporter2_1 | --extra-input-columns "public_metrics.bookmark_count"
twitterrestexporter2_1 | Skipping entire batch of 6982 tweets!
twitterrestexporter2_1 | 2023-09-20 05:39:54,192: twarc --> CSV Unexpected Data: "public_metrics.bookmark_count". Expected 83 columns, got 64. Skipping entire batch of 6982 tweets!
twitterrestexporter2_1 | 2023-09-20 05:39:54,259: __main__ --> DataFrame contains 0 rows.
@fishfree Thanks for posting your docker logs. I am unfortunately unable to test this for myself, because I don't currently have a working Twitter v.2 API key. But I believe the issue stems from the twarc-csv package, which generates the CSV files from the Twitter JSON. Twitter has been rather relentlessly tweaking their API schema, and whenever they add or drop a field from the JSON, twarc-csv needs to be updated.
I'm not sure whether the latest version of twarc-csv will handle this, but you could try the following:
docker exec -it twitterrestexporter2_1 /bin/bash
.pip install --upgrade twarc-csv
docker stop twitterrestexporter2_1
.docker start twitterrestexporter2_1
. With luck, when the container restarts, it will use the upgraded version of twarc-csv. If that doesn't work, you might try exporting the full JSON of the Tweets from SFM (since the full JSON export does not rely on twarc-csv) and using (the latest version of) twarc-csv outside the containers, at the command line, to convert the JSON to CSV. At the command line, you can even pass an argument to twarc-csv, as suggested by the error in the logs, which should correct for the issue: --extra-input-columns "public_metrics.bookmark_count"
.
Eventually, I should have time to push a new release of SFM with the latest twarc-csv library in the Docker images. But that probably won't be for another month or so.
In the meantime, I hope that helps!
@dolsysmith Thank you very much for your tip! I rebuid the image locally with twarc-csv 0.7.2. It works now.
@dolsysmith Sorry to bother you again. Now in a new server, I deployed sfm-docker with following docker-compose.yml:
version: "2"
services:
db:
image: gwul/sfm-ui-db:3.0.0
environment:
- POSTGRES_PASSWORD=${SFM_POSTGRES_PASSWORD}
- TZ
logging:
driver: json-file
options:
max-size: ${DOCKER_LOG_MAX_SIZE}
max-file: ${DOCKER_LOG_MAX_FILE}
volumes_from:
- data
restart: always
mq:
image: gwul/sfm-rabbitmq:3.0.0
hostname: mq
ports:
# Opens up the ports for RabbitMQ management
- "${SFM_RABBITMQ_MANAGEMENT_PORT}:15672"
environment:
- RABBITMQ_DEFAULT_USER=${SFM_RABBITMQ_USER}
- RABBITMQ_DEFAULT_PASS=${SFM_RABBITMQ_PASSWORD}
- TZ
logging:
driver: json-file
options:
max-size: ${DOCKER_LOG_MAX_SIZE}
max-file: ${DOCKER_LOG_MAX_FILE}
volumes_from:
- data
restart: always
# These containers will exit on startup. That's OK.
data:
image: gwul/sfm-data:3.0.0
volumes:
- ${DATA_VOLUME_MQ}
- ${DATA_VOLUME_DB}
- ${DATA_VOLUME_EXPORT}
- ${DATA_VOLUME_CONTAINERS}
- ${DATA_VOLUME_COLLECTION_SET}
# For SFM instances installed on 2.3.0 or earlier
# - ${DATA_VOLUME_FORMER_COLLECTION_SET}
# - ${DATA_VOLUME_FORMER_EXPORT}
environment:
- TZ
- SFM_UID
- SFM_GID
processingdata:
image: debian:buster
command: /bin/true
volumes:
- ${PROCESSING_VOLUME}
environment:
- TZ
ui:
image: gwul/sfm-ui:3.0.0
ports:
- "${SFM_PORT}:8080"
links:
- db:db
- mq:mq
environment:
- SFM_DEBUG=False
- SFM_APSCHEDULER_LOG=INFO
- SFM_UI_LOG=INFO
# This adds a 5 minute schedule option to speed testing.
- SFM_FIVE_MINUTE_SCHEDULE=False
# This adds a 100 item export segment for testing.
- SFM_HUNDRED_ITEM_SEGMENT=False
- TZ
- SFM_SITE_ADMIN_NAME
- SFM_SITE_ADMIN_EMAIL
- SFM_SITE_ADMIN_PASSWORD
- SFM_EMAIL_USER
- SFM_EMAIL_PASSWORD
- SFM_EMAIL_FROM
- SFM_SMTP_HOST
- SFM_HOST=${SFM_HOSTNAME}:${SFM_PORT}
- SFM_HOSTNAME
- SFM_CONTACT_EMAIL
- TWITTER_CONSUMER_KEY
- TWITTER_CONSUMER_SECRET
- WEIBO_API_KEY
- WEIBO_API_SECRET
- TUMBLR_CONSUMER_KEY
- TUMBLR_CONSUMER_SECRET
- SFM_RABBITMQ_USER
- SFM_RABBITMQ_PASSWORD
- SFM_RABBITMQ_HOST
- SFM_RABBITMQ_PORT
- SFM_RABBITMQ_MANAGEMENT_PORT
- SFM_POSTGRES_PASSWORD
- SFM_POSTGRES_HOST
- SFM_POSTGRES_PORT
# To have some test accounts created.
- LOAD_FIXTURES=False
- SFM_REQS=release
- DATA_VOLUME_THRESHOLD_DB
- DATA_VOLUME_THRESHOLD_MQ
- DATA_VOLUME_THRESHOLD_EXPORT
- DATA_VOLUME_THRESHOLD_CONTAINERS
- DATA_VOLUME_THRESHOLD_COLLECTION_SET
- PROCESSING_VOLUME_THRESHOLD
- DATA_SHARED_USED
- DATA_SHARED_DIR
- DATA_THRESHOLD_SHARED
- SFM_UID
- SFM_GID
- SFM_INSTITUTION_NAME
- SFM_INSTITUTION_LINK
- SFM_ENABLE_COOKIE_CONSENT
- SFM_COOKIE_CONSENT_HTML
- SFM_COOKIE_CONSENT_BUTTON_TEXT
- SFM_ENABLE_GW_FOOTER
- SFM_MONITOR_QUEUE_HOUR_INTERVAL
- SFM_SCAN_FREE_SPACE_HOUR_INTERVAL
- SFM_WEIBO_SEARCH_OPTION
- SFM_USE_HTTPS
- SFM_USE_ELB
- TWITTER_COLLECTION_TYPES
# For ngninx-proxy
- VIRTUAL_HOST=${SFM_HOSTNAME}
- VIRTUAL_PORT=${SFM_PORT}
logging:
driver: json-file
options:
max-size: ${DOCKER_LOG_MAX_SIZE}
max-file: ${DOCKER_LOG_MAX_FILE}
volumes_from:
- data
- processingdata
# # Comment out volumes section if SFM data is stored on mounted filesystems and DATA_SHARED_USED is False.
volumes:
- "${DATA_SHARED_DIR}:/sfm-data-shared"
restart: always
uiconsumer:
image: gwul/sfm-ui-consumer:3.0.0
links:
- db:db
- mq:mq
- ui:ui
environment:
- SFM_DEBUG=False
- SFM_APSCHEDULER_LOG=INFO
- SFM_UI_LOG=INFO
- TZ
- SFM_SITE_ADMIN_NAME
- SFM_SITE_ADMIN_EMAIL
- SFM_SITE_ADMIN_PASSWORD
- SFM_EMAIL_USER
- SFM_EMAIL_PASSWORD
- SFM_EMAIL_FROM
- SFM_SMTP_HOST
- SFM_HOST=${SFM_HOSTNAME}:${SFM_PORT}
- SFM_RABBITMQ_USER
- SFM_RABBITMQ_PASSWORD
- SFM_RABBITMQ_HOST
- SFM_RABBITMQ_PORT
- SFM_POSTGRES_PASSWORD
- SFM_POSTGRES_HOST
- SFM_POSTGRES_PORT
- SFM_REQS=release
- SFM_UID
- SFM_GID
- SFM_USE_HTTPS
volumes_from:
- data
- processingdata
restart: always
# Twitter
twitterrestharvester:
image: gwul/sfm-twitter-rest-harvester:3.0.0
links:
- mq:mq
environment:
- TZ
- DEBUG=False
- SFM_RABBITMQ_USER
- SFM_RABBITMQ_PASSWORD
- SFM_RABBITMQ_HOST
- SFM_RABBITMQ_PORT
- SFM_REQS=release
- HARVEST_TRIES=${TWITTER_REST_HARVEST_TRIES}
- SFM_UID
- SFM_GID
- PRIORITY_QUEUES=False
logging:
driver: json-file
options:
max-size: ${DOCKER_LOG_MAX_SIZE}
max-file: ${DOCKER_LOG_MAX_FILE}
volumes_from:
- data
restart: always
twitterpriorityrestharvester:
image: gwul/sfm-twitter-rest-harvester:3.0.0
links:
- mq:mq
environment:
- TZ
- DEBUG=False
- SFM_RABBITMQ_USER
- SFM_RABBITMQ_PASSWORD
- SFM_RABBITMQ_HOST
- SFM_RABBITMQ_PORT
- SFM_REQS=release
- HARVEST_TRIES=${TWITTER_REST_HARVEST_TRIES}
- SFM_UID
- SFM_GID
- PRIORITY_QUEUES=True
logging:
driver: json-file
options:
max-size: ${DOCKER_LOG_MAX_SIZE}
max-file: ${DOCKER_LOG_MAX_FILE}
volumes_from:
- data
restart: always
twitterrestexporter2:
image: myown/sfm-twitter-rest-exporter-v2:3.0.0
links:
- mq:mq
- ui:api
environment:
- TZ
- DEBUG
- SFM_RABBITMQ_USER
- SFM_RABBITMQ_PASSWORD
- SFM_RABBITMQ_HOST
- SFM_RABBITMQ_PORT
- SFM_REQS=${TWITTER_REQS}
- SFM_UID
- SFM_GID
- SFM_UPGRADE_REQS=${UPGRADE_REQS}
- MAX_DATAFRAME_ROWS
logging:
driver: json-file
options:
max-size: ${DOCKER_LOG_MAX_SIZE}
max-file: ${DOCKER_LOG_MAX_FILE}
volumes_from:
- data
restart: always
# PROCESSING
# This container will exit on startup. That's OK.
processing:
image: gwul/sfm-processing:master
links:
- ui:api
environment:
- TZ
logging:
driver: json-file
options:
max-size: ${DOCKER_LOG_MAX_SIZE}
max-file: ${DOCKER_LOG_MAX_FILE}
volumes_from:
- data:ro
- processingdata
Among these images, I built the myown/sfm-twitter-rest-exporter-v2:3.0.0
myself accoring to my post above. When exporting data, however, the docker-compose logs -f
says errors as below:
| 💔 ERROR: 1 Unexpected items in data!
twitterrestexporter2_1 | Are you sure you specified the correct --input-data-type?
twitterrestexporter2_1 | If the object type is correct, add extra columns with:
twitterrestexporter2_1 | --extra-input-columns "author.public_metrics.like_count"
twitterrestexporter2_1 | Skipping entire batch of 10383 tweets!
twitterrestexporter2_1 | 2023-11-14 17:49:48,748: twarc --> CSV Unexpected Data: "author.public_metrics.like_count". Expected 84 columns, got 58. Skipping entire batch of 10383 tweets!
twitterrestexporter2_1 | 2023-11-14 17:49:48,811: __main__ --> DataFrame contains 0 rows.
twitterrestexporter2_1 | 2023-11-14 17:49:48,812: sfmutils.consumer --> Sending message to sfm_exchange with routing_key export.status.twitter2.twitter_user_timeline_2. The body is: {
Hi @fishfree, my guess is that the Twitter data model has changed again, and that twarc-csv needs another update. Since there hasn't been another release since 0.7.2, I would recommend opening an issue on the twarc-csv repo.
It might be possible to modify the SFM twitter-exporter code to check for these errors and respond accordingly; I'll keep this issue open as a reminder to look at this in a future sprint.
Thanks for letting us know.
@dolsysmith There is indeed public_metrics.like_count
here, rather than author.public_metrics.like_count
in sfm-docker error log. And it is also public_metrics.like_count
in the Twitter official doc. So is it still the problem of our code?
I wouldn't be surprised if Twitter had not updated their official docs. I don't think the API is much of a priority for them right now. So yes, I imagine the API has changed, and that has broken the twarc-csv dataframe_converter
code.
@dolsysmith Thank you! Then is there any alternative way to export harveested data as CSV files?
@fishfree I would approach it in two steps.
--extra-input-columns
, which should allow you to specify the column that causing the error.
Acturally I installed SFM UI the docker way. I install the 3.0.0 to adopt Twitter 2.0 API due to the termination of the free Twitter 1.1 API. On the collection page, it shows results:
After exporting, it shows:
However, I open the exported file, always only has the headers, no content: