singer-io / tap-closeio

Tap for Close.io
GNU Affero General Public License v3.0
7 stars 10 forks source link

Activities of type Meeting impact the bookmark and stop extraction #20

Open tconbeer opened 4 years ago

tconbeer commented 4 years ago

It's possible for an activity to be created when a Meeting is set in CloseIO. That activity has a date_created that is equal to the scheduled date of the meeting, which is likely in the future. These records also have a status of upcoming and a source of calendar.

The Close API probably shouldn't work this way, but if that won't change, we need to update the tap to either use a different timestamp (there is a date_updated field that is probably better anyway) or special-case these Meeting activities to exclude them from either the sync or the bookmark calculation (which should be a simple comparison of record in the method here)

cc @apbenn

tconbeer commented 3 years ago

If I open a PR against this issue, would it be accepted? This tap seems unmaintained.

pmauro commented 3 years ago

Why aren't we using the date_updated field for the activities table\Stream? This is the only Stream that uses date_updated instead of date_created as its BOOK_KEY (line 29 of tap-closeio/streams.py). I'd make the change and issue a PR but others may be better suited to test the fix. (I'm also curious why the BOOK_KEY for activities is the only one that doesn't use date_updated. This seems intentional.)

pmauro commented 3 years ago

It looks like @cosimon tried to fix this:

https://github.com/singer-io/tap-closeio/compare/fix/ignore-future-dated-activities

I still think that keying on date_updated is more correct, but both changes should probably be pulled.

pmauro commented 3 years ago

More digging. @cdlethem committed the change I'm suggesting to his/her fork. We could just pull request from that if the QA's been done...

https://github.com/singer-io/tap-closeio/compare/singer-io:master...cdlethem:master

cdlethem commented 3 years ago

For what it's worth, I couldn't find a reason that activities used the date_created field as a key, perhaps the author had in mind activities like opportunity_status_changed which wouldn't ever be altered. I chose to change it to date_updated instead of filtering out future activities, because for my use case, I am concerned with how many future meetings a user has scheduled, for instance.

pmauro commented 3 years ago

Ha. Well the future meetings -- this is redundant in 99% of cases -- are what are killing me. It breaks the ETL.

rahimnathwani commented 3 years ago

@cdlethem "I couldn't find a reason that activities used the date_created field as a key"

I was curious about this so looked at the paginated_sync() function. It looks like:

  1. date_created is being used to filter activities we get from the endpoint, but
  2. we're not passing date_created as a parameter to the Close API

Is #2 correct? If so, it seems like (i) the README's rationale for not using date_created doesn't make sense, and (ii) you are right, in that we could just change that to any other field that exists in the response objects.

Have I understood that correctly, that the tap is requesting all activities ever, even when doing an incremental sync for the last day?