status-im / airbyte-custom-connector

Repository holding all of Airbyte Custom Connector used in Status
1 stars 1 forks source link

[Twitter] Fetch all tweet, not only the last ones #17

Open apentori opened 1 week ago

apentori commented 1 week ago

Description

For the moment, the connector only fetch the information of the last tweet. We need to fetch all the tweet history at least once.

apentori commented 1 week ago

According to the documentation of the endpoint /2/users/:id/tweets, the query should accept the following parameters:

In order to get all the tweet history, the connector should have the pagination enabled, this will make the connector fetch all the tweet in one execution.

This is great but it will be long and might consum all the API rate limit twitter documentation is not clear on the Rate limiting ,either 1500 request per 15 min or 15 per 15 min with a limit of 10,000 per 30 days for basic plan and 450 per 15 min aith a limit of 1 296 000 per 30d.( see GET_2_tweets in https://developer.x.com/en/docs/twitter-api/rate-limits#v2-limits-basic and https://developer.x.com/en/docs/twitter-api/rate-limits#v2-limits-pro - biggest number are found in https://developer.x.com/en/docs/twitter-api/tweets/timelines/api-reference/get-users-id-tweets but seems to be false based on the Rate limiting of the current setup ). Fetching all tweet history everyday for all account will for sure make us reach the 30d API limit and probably will make the connector run for multiple hours in order to fetch all tweet of some accounts.

In order to get all history and still run normally the connector the following should be implemented:

JoseAnaya28 commented 1 week ago

For the promoted metrics, there's a section in this link on a sample request.

curl 'https://api.twitter.com/2/tweets/1204084171334832128?tweet.fields=non_public_metrics,organic_metrics&media.fields=non_public_metrics,organic_metrics&expansions=attachments.media_keys'
--header 'authorization: OAuth oauth_consumer_key="CONSUMER_API_KEY", oauth_nonce="OAUTH_NONCE", oauth_signature="OAUTH_SIGNATURE", oauth_signature_method="HMAC-SHA1", oauth_timestamp="OAUTH_TIMESTAMP", oauth_token="ACCESS_TOKEN", oauth_version="1.0"' 
apentori commented 1 week ago

Error when trying to implement the start_time limit:

Caused by: io.temporal.failure.ApplicationFailure: message='Integration failed to output a spec struct and did not output a failure reason', type='io.airbyte.workers.exception.WorkerException', nonRetryable=false
    at io.airbyte.workers.WorkerUtils.throwWorkerException(WorkerUtils.java:269) ~[io.airbyte-airbyte-commons-worker-0.55.0.jar:?]
    at io.airbyte.workers.general.DefaultGetSpecWorker.run(DefaultGetSpecWorker.java:78) ~[io.airbyte-airbyte-commons-worker-0.55.0.jar:?]
    at io.airbyte.workers.general.DefaultGetSpecWorker.run(DefaultGetSpecWorker.java:36) ~[io.airbyte-airbyte-commons-worker-0.55.0.jar:?]
    at io.airbyte.workers.temporal.TemporalAttemptExecution.get(TemporalAttemptExecution.java:142) ~[io.airbyte-airbyte-workers-0.55.0.jar:?]
    at io.airbyte.workers.temporal.spec.SpecActivityImpl.lambda$run$2(SpecActivityImpl.java:179) ~[io.airbyte-airbyte-workers-0.55.0.jar:?]
    at io.airbyte.commons.temporal.HeartbeatUtils.withBackgroundHeartbeat(HeartbeatUtils.java:57) ~[io.airbyte-airbyte-commons-temporal-core-0.55.0.jar:?]
    at io.airbyte.workers.temporal.spec.SpecActivityImpl.run(SpecActivityImpl.java:163) ~[io.airbyte-airbyte-workers-0.55.0.jar:?]
    at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) ~[?:?]
    at java.base/java.lang.reflect.Method.invoke(Method.java:580) ~[?:?]
    at io.temporal.internal.activity.RootActivityInboundCallsInterceptor$POJOActivityInboundCallsInterceptor.executeActivity(RootActivityInboundCallsInterceptor.java:64) ~[temporal-sdk-1.22.3.jar:?]
    at io.temporal.internal.activity.RootActivityInboundCallsInterceptor.execute(RootActivityInboundCallsInterceptor.java:43) ~[temporal-sdk-1.22.3.jar:?]
    at io.temporal.internal.activity.ActivityTaskExecutors$BaseActivityTaskExecutor.execute(ActivityTaskExecutors.java:107) ~[temporal-sdk-1.22.3.jar:?]
    at io.temporal.internal.activity.ActivityTaskHandlerImpl.handle(ActivityTaskHandlerImpl.java:124) ~[temporal-sdk-1.22.3.jar:?]
    at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handleActivity(ActivityWorker.java:278) ~[temporal-sdk-1.22.3.jar:?]
    at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handle(ActivityWorker.java:243) ~[temporal-sdk-1.22.3.jar:?]
    at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handle(ActivityWorker.java:216) ~[temporal-sdk-1.22.3.jar:?]
    at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:105) ~[temporal-sdk-1.22.3.jar:?]
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
    at java.base/java.lang.Thread.run(Thread.java:1583) ~[?:?]

The connector doesn't like using strftime function to convert date to string.

"start_time": self.start_time.strftime("%Y-%m-%d%H:%M:%SZ")}
apentori commented 1 week ago

Letting the connectors run without time limit didn't fetch all the tweet history, for example for the account statuseth, we only got 27 tweets before the endpoint returned an empty response:

 {'meta': {'next_token': '7140dibdnow9c7btw482mq8sqdarz7m9kb6zo92i8wo51', 'previous_token': '77qpymm88g5h9vqkluxex6at4ibn4hol7dahowoza9u0g', 'result_count': 0}}
apentori commented 1 week ago

Returns Tweets composed by a single user, specified by the requested user ID. By default, the most recent ten Tweets are returned per request. Using pagination, the most recent 3,200 Tweets can be retrieved.

According to the documentation we should get more than 27....

https://developer.x.com/en/docs/twitter-api/tweet-caps

Basic tier 10,000 Posts per month

If the same Post is returned from multiple queries during a day, then the Post is only counted once against the post cap - i.e, the Posts are deduplicated.

The tweet caps should limit use to this small amount