gwu-libraries / sfm-ui

Social Feed Manager user interface application.
http://gwu-libraries.github.io/sfm-ui
MIT License
153 stars 25 forks source link

Exported file always empty data on Version 3.0.0 #1149

Open fishfree opened 1 year ago

fishfree commented 1 year ago

Acturally I installed SFM UI the docker way. I install the 3.0.0 to adopt Twitter 2.0 API due to the termination of the free Twitter 1.1 API. On the collection page, it shows results: image

After exporting, it shows: image

However, I open the exported file, always only has the headers, no content: image

fishfree commented 1 year ago

@dolsysmith Hi Dolsy, sorry to bother you. I found you upgraded the docker-compose.yml to v 3.0.0. Could you pls have a look at this issue? And could you also pls confirm if your exporter working as expected. Thank you very much!

I notcied thedocker-compose logs -f ouput:

twitterrestexporter2_1          | 2023-09-20 05:39:49,867: sfmutils.warc_iter --> Iterating over /sfm-collection-set-data/collection_set/5446e7cd861d4a24a759b1f5bdce278e/767b28d78b164cf7b7ca83257fee81d4/2023/09/17/23/5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz
twitterrestexporter2_1          | 2023-09-20 05:39:49,870: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 0 records. Yielded 0 items.
twitterrestexporter2_1          | 2023-09-20 05:39:50,105: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 3 records. Yielded 100 items.
twitterrestexporter2_1          | 2023-09-20 05:39:50,271: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 4 records. Yielded 200 items.
twitterrestexporter2_1          | 2023-09-20 05:39:50,521: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 5 records. Yielded 300 items.
twitterrestexporter2_1          | 2023-09-20 05:39:50,660: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 6 records. Yielded 400 items.
twitterrestexporter2_1          | 2023-09-20 05:39:50,792: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 7 records. Yielded 500 items.
twitterrestexporter2_1          | 2023-09-20 05:39:50,906: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 8 records. Yielded 600 items.
twitterrestexporter2_1          | 2023-09-20 05:39:51,005: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 9 records. Yielded 700 items.
twitterrestexporter2_1          | 2023-09-20 05:39:51,064: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 10 records. Yielded 798 items.
twitterrestexporter2_1          | 2023-09-20 05:39:51,081: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 10 records. Yielded 800 items.
twitterrestexporter2_1          | 2023-09-20 05:39:51,258: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 13 records. Yielded 900 items.
twitterrestexporter2_1          | 2023-09-20 05:39:51,475: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 15 records. Yielded 1000 items.
twitterrestexporter2_1          | 2023-09-20 05:39:52,220: sfmutils.warc_iter --> File 5eba82fe6c67496990a6d669e8506ba0-20230917231415958-00000-qgr0369b.warc.gz. Processed 20 records. Yielded 1448 items.
twitterrestexporter2_1          | 💔 ERROR: 1 Unexpected items in data! 
twitterrestexporter2_1          | Are you sure you specified the correct --input-data-type?
twitterrestexporter2_1          | If the object type is correct, add extra columns with:
twitterrestexporter2_1          | --extra-input-columns "public_metrics.bookmark_count"
twitterrestexporter2_1          | Skipping entire batch of 6982 tweets!
twitterrestexporter2_1          | 2023-09-20 05:39:54,192: twarc --> CSV Unexpected Data: "public_metrics.bookmark_count". Expected 83 columns, got 64. Skipping entire batch of 6982 tweets!
twitterrestexporter2_1          | 2023-09-20 05:39:54,259: __main__ --> DataFrame contains 0 rows.
dolsysmith commented 1 year ago

@fishfree Thanks for posting your docker logs. I am unfortunately unable to test this for myself, because I don't currently have a working Twitter v.2 API key. But I believe the issue stems from the twarc-csv package, which generates the CSV files from the Twitter JSON. Twitter has been rather relentlessly tweaking their API schema, and whenever they add or drop a field from the JSON, twarc-csv needs to be updated.

I'm not sure whether the latest version of twarc-csv will handle this, but you could try the following:

  1. Open a bash session in the exporter container: docker exec -it twitterrestexporter2_1 /bin/bash.
  2. Run pip install --upgrade twarc-csv
  3. Exit the bash shell and stop, but do not delete, the container: docker stop twitterrestexporter2_1.
  4. Restart the container: docker start twitterrestexporter2_1.

With luck, when the container restarts, it will use the upgraded version of twarc-csv. If that doesn't work, you might try exporting the full JSON of the Tweets from SFM (since the full JSON export does not rely on twarc-csv) and using (the latest version of) twarc-csv outside the containers, at the command line, to convert the JSON to CSV. At the command line, you can even pass an argument to twarc-csv, as suggested by the error in the logs, which should correct for the issue: --extra-input-columns "public_metrics.bookmark_count".

Eventually, I should have time to push a new release of SFM with the latest twarc-csv library in the Docker images. But that probably won't be for another month or so.

In the meantime, I hope that helps!

fishfree commented 1 year ago

@dolsysmith Thank you very much for your tip! I rebuid the image locally with twarc-csv 0.7.2. It works now.

fishfree commented 12 months ago

@dolsysmith Sorry to bother you again. Now in a new server, I deployed sfm-docker with following docker-compose.yml:

version: "2"
services:
    db:
        image: gwul/sfm-ui-db:3.0.0
        environment:
            - POSTGRES_PASSWORD=${SFM_POSTGRES_PASSWORD}
            - TZ
        logging:
            driver: json-file
            options:
                max-size: ${DOCKER_LOG_MAX_SIZE}
                max-file: ${DOCKER_LOG_MAX_FILE}
        volumes_from:
            - data
        restart: always
    mq:
        image: gwul/sfm-rabbitmq:3.0.0
        hostname: mq
        ports:
            # Opens up the ports for RabbitMQ management
            - "${SFM_RABBITMQ_MANAGEMENT_PORT}:15672"
        environment:
            - RABBITMQ_DEFAULT_USER=${SFM_RABBITMQ_USER}
            - RABBITMQ_DEFAULT_PASS=${SFM_RABBITMQ_PASSWORD}
            - TZ
        logging:
            driver: json-file
            options:
                max-size: ${DOCKER_LOG_MAX_SIZE}
                max-file: ${DOCKER_LOG_MAX_FILE}
        volumes_from:
            - data
        restart: always
    # These containers will exit on startup. That's OK.
    data:
        image: gwul/sfm-data:3.0.0
        volumes:
             - ${DATA_VOLUME_MQ}
             - ${DATA_VOLUME_DB}
             - ${DATA_VOLUME_EXPORT}
             - ${DATA_VOLUME_CONTAINERS}
             - ${DATA_VOLUME_COLLECTION_SET}
             # For SFM instances installed on 2.3.0 or earlier
             # - ${DATA_VOLUME_FORMER_COLLECTION_SET}
             # - ${DATA_VOLUME_FORMER_EXPORT}
        environment:
            - TZ
            - SFM_UID
            - SFM_GID
    processingdata:
        image: debian:buster
        command: /bin/true
        volumes:
             - ${PROCESSING_VOLUME}
        environment:
            - TZ
    ui:
        image: gwul/sfm-ui:3.0.0
        ports:
            - "${SFM_PORT}:8080"
        links:
            - db:db
            - mq:mq
        environment:
            - SFM_DEBUG=False
            - SFM_APSCHEDULER_LOG=INFO
            - SFM_UI_LOG=INFO
            # This adds a 5 minute schedule option to speed testing.
            - SFM_FIVE_MINUTE_SCHEDULE=False
            # This adds a 100 item export segment for testing.
            - SFM_HUNDRED_ITEM_SEGMENT=False
            - TZ
            - SFM_SITE_ADMIN_NAME
            - SFM_SITE_ADMIN_EMAIL
            - SFM_SITE_ADMIN_PASSWORD
            - SFM_EMAIL_USER
            - SFM_EMAIL_PASSWORD
            - SFM_EMAIL_FROM
            - SFM_SMTP_HOST
            - SFM_HOST=${SFM_HOSTNAME}:${SFM_PORT}
            - SFM_HOSTNAME
            - SFM_CONTACT_EMAIL
            - TWITTER_CONSUMER_KEY
            - TWITTER_CONSUMER_SECRET
            - WEIBO_API_KEY
            - WEIBO_API_SECRET
            - TUMBLR_CONSUMER_KEY
            - TUMBLR_CONSUMER_SECRET
            - SFM_RABBITMQ_USER
            - SFM_RABBITMQ_PASSWORD
            - SFM_RABBITMQ_HOST
            - SFM_RABBITMQ_PORT
            - SFM_RABBITMQ_MANAGEMENT_PORT
            - SFM_POSTGRES_PASSWORD
            - SFM_POSTGRES_HOST
            - SFM_POSTGRES_PORT
            # To have some test accounts created.
            - LOAD_FIXTURES=False
            - SFM_REQS=release
            - DATA_VOLUME_THRESHOLD_DB
            - DATA_VOLUME_THRESHOLD_MQ
            - DATA_VOLUME_THRESHOLD_EXPORT
            - DATA_VOLUME_THRESHOLD_CONTAINERS
            - DATA_VOLUME_THRESHOLD_COLLECTION_SET
            - PROCESSING_VOLUME_THRESHOLD
            - DATA_SHARED_USED
            - DATA_SHARED_DIR
            - DATA_THRESHOLD_SHARED
            - SFM_UID
            - SFM_GID
            - SFM_INSTITUTION_NAME
            - SFM_INSTITUTION_LINK
            - SFM_ENABLE_COOKIE_CONSENT
            - SFM_COOKIE_CONSENT_HTML
            - SFM_COOKIE_CONSENT_BUTTON_TEXT
            - SFM_ENABLE_GW_FOOTER
            - SFM_MONITOR_QUEUE_HOUR_INTERVAL
            - SFM_SCAN_FREE_SPACE_HOUR_INTERVAL
            - SFM_WEIBO_SEARCH_OPTION
            - SFM_USE_HTTPS
            - SFM_USE_ELB
            - TWITTER_COLLECTION_TYPES
            # For ngninx-proxy
            - VIRTUAL_HOST=${SFM_HOSTNAME}
            - VIRTUAL_PORT=${SFM_PORT}
        logging:
            driver: json-file
            options:
                max-size: ${DOCKER_LOG_MAX_SIZE}
                max-file: ${DOCKER_LOG_MAX_FILE}
        volumes_from:
            - data
            - processingdata
#       # Comment out volumes section if SFM data is stored on mounted filesystems and DATA_SHARED_USED is False.
        volumes:
            - "${DATA_SHARED_DIR}:/sfm-data-shared"
        restart: always
    uiconsumer:
        image: gwul/sfm-ui-consumer:3.0.0
        links:
            - db:db
            - mq:mq
            - ui:ui
        environment:
            - SFM_DEBUG=False
            - SFM_APSCHEDULER_LOG=INFO
            - SFM_UI_LOG=INFO
            - TZ
            - SFM_SITE_ADMIN_NAME
            - SFM_SITE_ADMIN_EMAIL
            - SFM_SITE_ADMIN_PASSWORD
            - SFM_EMAIL_USER
            - SFM_EMAIL_PASSWORD
            - SFM_EMAIL_FROM
            - SFM_SMTP_HOST
            - SFM_HOST=${SFM_HOSTNAME}:${SFM_PORT}
            - SFM_RABBITMQ_USER
            - SFM_RABBITMQ_PASSWORD
            - SFM_RABBITMQ_HOST
            - SFM_RABBITMQ_PORT
            - SFM_POSTGRES_PASSWORD
            - SFM_POSTGRES_HOST
            - SFM_POSTGRES_PORT
            - SFM_REQS=release
            - SFM_UID
            - SFM_GID
            - SFM_USE_HTTPS
        volumes_from:
            - data
            - processingdata
        restart: always
# Twitter
    twitterrestharvester:
        image: gwul/sfm-twitter-rest-harvester:3.0.0
        links:
            - mq:mq
        environment:
            - TZ
            - DEBUG=False
            - SFM_RABBITMQ_USER
            - SFM_RABBITMQ_PASSWORD
            - SFM_RABBITMQ_HOST
            - SFM_RABBITMQ_PORT
            - SFM_REQS=release
            - HARVEST_TRIES=${TWITTER_REST_HARVEST_TRIES}
            - SFM_UID
            - SFM_GID
            - PRIORITY_QUEUES=False
        logging:
            driver: json-file
            options:
                max-size: ${DOCKER_LOG_MAX_SIZE}
                max-file: ${DOCKER_LOG_MAX_FILE}
        volumes_from:
            - data
        restart: always
    twitterpriorityrestharvester:
        image: gwul/sfm-twitter-rest-harvester:3.0.0
        links:
            - mq:mq
        environment:
            - TZ
            - DEBUG=False
            - SFM_RABBITMQ_USER
            - SFM_RABBITMQ_PASSWORD
            - SFM_RABBITMQ_HOST
            - SFM_RABBITMQ_PORT
            - SFM_REQS=release
            - HARVEST_TRIES=${TWITTER_REST_HARVEST_TRIES}
            - SFM_UID
            - SFM_GID
            - PRIORITY_QUEUES=True
        logging:
            driver: json-file
            options:
                max-size: ${DOCKER_LOG_MAX_SIZE}
                max-file: ${DOCKER_LOG_MAX_FILE}
        volumes_from:
            - data
        restart: always

    twitterrestexporter2:
        image: myown/sfm-twitter-rest-exporter-v2:3.0.0
        links:
            - mq:mq
            - ui:api
        environment:
            - TZ
            - DEBUG
            - SFM_RABBITMQ_USER
            - SFM_RABBITMQ_PASSWORD
            - SFM_RABBITMQ_HOST
            - SFM_RABBITMQ_PORT
            - SFM_REQS=${TWITTER_REQS}
            - SFM_UID
            - SFM_GID
            - SFM_UPGRADE_REQS=${UPGRADE_REQS}
            - MAX_DATAFRAME_ROWS
        logging:
            driver: json-file
            options:
                max-size: ${DOCKER_LOG_MAX_SIZE}
                max-file: ${DOCKER_LOG_MAX_FILE}
        volumes_from:
            - data
        restart: always

# PROCESSING
    # This container will exit on startup. That's OK.
    processing:
        image: gwul/sfm-processing:master
        links:
            - ui:api
        environment:
            - TZ
        logging:
            driver: json-file
            options:
                max-size: ${DOCKER_LOG_MAX_SIZE}
                max-file: ${DOCKER_LOG_MAX_FILE}
        volumes_from:
            - data:ro
            - processingdata

Among these images, I built the myown/sfm-twitter-rest-exporter-v2:3.0.0 myself accoring to my post above. When exporting data, however, the docker-compose logs -f says errors as below:

 | 💔 ERROR: 1 Unexpected items in data! 
twitterrestexporter2_1          | Are you sure you specified the correct --input-data-type?
twitterrestexporter2_1          | If the object type is correct, add extra columns with:
twitterrestexporter2_1          | --extra-input-columns "author.public_metrics.like_count"
twitterrestexporter2_1          | Skipping entire batch of 10383 tweets!
twitterrestexporter2_1          | 2023-11-14 17:49:48,748: twarc --> CSV Unexpected Data: "author.public_metrics.like_count". Expected 84 columns, got 58. Skipping entire batch of 10383 tweets!
twitterrestexporter2_1          | 2023-11-14 17:49:48,811: __main__ --> DataFrame contains 0 rows.
twitterrestexporter2_1          | 2023-11-14 17:49:48,812: sfmutils.consumer --> Sending message to sfm_exchange with routing_key export.status.twitter2.twitter_user_timeline_2. The body is: {
dolsysmith commented 12 months ago

Hi @fishfree, my guess is that the Twitter data model has changed again, and that twarc-csv needs another update. Since there hasn't been another release since 0.7.2, I would recommend opening an issue on the twarc-csv repo.

It might be possible to modify the SFM twitter-exporter code to check for these errors and respond accordingly; I'll keep this issue open as a reminder to look at this in a future sprint.

Thanks for letting us know.

fishfree commented 12 months ago

@dolsysmith There is indeed public_metrics.like_count here, rather than author.public_metrics.like_count in sfm-docker error log. And it is also public_metrics.like_count in the Twitter official doc. So is it still the problem of our code?

dolsysmith commented 12 months ago

I wouldn't be surprised if Twitter had not updated their official docs. I don't think the API is much of a priority for them right now. So yes, I imagine the API has changed, and that has broken the twarc-csv dataframe_converter code.

fishfree commented 11 months ago

@dolsysmith Thank you! Then is there any alternative way to export harveested data as CSV files?

dolsysmith commented 11 months ago

@fishfree I would approach it in two steps.

  1. Use SFM's command-line tools to extract the Tweet JSON from the downloaded WARC files. T
  2. Use twarc-csv on the extracts JSON with the command-line parameter --extra-input-columns, which should allow you to specify the column that causing the error.