near / data-platform

Near.org Data Platform Initiatives and Epics: Recommendations, Search, Analaytics
1 stars 0 forks source link

Developer Contributions Stats Pipeline #24

Closed pkudinov closed 9 months ago

pkudinov commented 1 year ago

Description

Create a contribution stats pipeline for the Developer Contribution Board.


Data Needed

Value Detail
Name User's display name
Account ID User's account ID
Total Components Total number of created components during selected timeframe
Total Stars Total number of stars their components received during selected timeframe
Total Forks Total number of forks their components had during selected timeframe
Total Commits Total number of published code events during selected timeframe
Total Lines of Code Total lines of code shipped during selected timeframe
Start Date The date a user shipped their first line of BOS code

Resources

thisisjoshford commented 10 months ago

I spoke with @Oksanusha the week before last about removing forks from this pipeline and just updated the ticket to strike out that field. Is this issue complete?

jinouyang commented 10 months ago

@Oksanusha what's the latest update on this?

Oksanusha commented 10 months ago

@jinouyang @thisisjoshford Hi guys! This is the last update. Please notice that the tag column is still in progress because, for some widgets, my code doesn't show the actual tags.

Also, there is an issue with stars. For some widgets, the number of total stars needs to be corrected.

s-n-park commented 10 months ago

@Oksanusha - heads up I think these 2 queries should be able to help you along:

  1. Commits - I think this can actually be simplified considerably to count number of times that the widget field is not null: Query - using this example

  2. Tag keys - more complicated but essentially I think a decent method is to search through each of the widget keys specifically for the path "...:metadata:tags" and then pull json_object_keys to get the array of tags (see query) using Neat as an example

I used a lot of your logic and just played around with a few examples, let me know if I can clarify anything!

Oksanusha commented 10 months ago

Hi @s-n-park! Thanks for helping! I used these kind of queries, but it still has two issues: the number of commits doesn't match for some widgets (I found that our internal data has more commits than the commit's history on the website), and it's not easy to build a query to show an actual list of tags for every date_utc, because some tags can be added and removed several times and other tags used only in one tx several weeks ago and they didn't reflect in the last commit, so again it doesn't match with our website.

Oksanusha commented 10 months ago

Perhaps the tags task can be resolved using udf

s-n-park commented 10 months ago

@Oksanusha - looks like the commit difference might be related to failed trxns

Here's an example transaction where this seems to happen which matches up with one of the discrepancies you noted on 10/1/23

Image

~~UPDATE

Was able to get to the 254 down to the corrected 249 commits when excluding failed trxns explicitly: sample query

s-n-park commented 10 months ago

@Oksanusha - the tag logic is again much more complicated but somewhat similar to the star set logic:

Everytime the list of tags show up, it must be broken down into the key/value pairs. If the value is set to null, it appears that this is equivalent to removing that tag. I took the specific example you called out and was able to demonstrate how bridge/lending, and dexes show up because polygon-zkevm, and liquidity manager have most recently had their values set to null

Image

investigation query

Off the top of my head, the best psuedo logic for getting this would probably be to window function where you partition by widget + tag keys and then only take the row_num=1 for when the latest value is not null.

Oksanusha commented 10 months ago

@s-n-park Fantastic! Thank you so much!!!

Oksanusha commented 9 months ago

Hey there! The pipeline is done. I compared developers and widget information with widget data on near.org; randomly chosen widgets have identical data (number of commits, stars, unique dependencies, tags, discussions, website link)

The gold_near_social_widgets table shows widget information for every tx.

The gold_daily_component_stats table shows aggregated information by every widget daily.

Gold_near_social_daily_weekly_monthly_new_developers, gold_near_social_monthly_top_10_code_line_writers, and gold_near_social_retention_rate tables stayed the same.