Closed pkudinov closed 9 months ago
I spoke with @Oksanusha the week before last about removing forks from this pipeline and just updated the ticket to strike out that field. Is this issue complete?
@Oksanusha what's the latest update on this?
@jinouyang @thisisjoshford Hi guys! This is the last update. Please notice that the tag column is still in progress because, for some widgets, my code doesn't show the actual tags.
Also, there is an issue with stars. For some widgets, the number of total stars needs to be corrected.
@Oksanusha - heads up I think these 2 queries should be able to help you along:
Commits - I think this can actually be simplified considerably to count number of times that the widget field is not null: Query - using this example
Tag keys - more complicated but essentially I think a decent method is to search through each of the widget keys specifically for the path "...:metadata:tags" and then pull json_object_keys to get the array of tags (see query) using Neat as an example
I used a lot of your logic and just played around with a few examples, let me know if I can clarify anything!
Hi @s-n-park! Thanks for helping! I used these kind of queries, but it still has two issues: the number of commits doesn't match for some widgets (I found that our internal data has more commits than the commit's history on the website), and it's not easy to build a query to show an actual list of tags for every date_utc, because some tags can be added and removed several times and other tags used only in one tx several weeks ago and they didn't reflect in the last commit, so again it doesn't match with our website.
Perhaps the tags task can be resolved using udf
@Oksanusha - looks like the commit difference might be related to failed trxns
Here's an example transaction where this seems to happen which matches up with one of the discrepancies you noted on 10/1/23
~~UPDATE
Was able to get to the 254 down to the corrected 249 commits when excluding failed trxns explicitly: sample query
@Oksanusha - the tag logic is again much more complicated but somewhat similar to the star set logic:
Everytime the list of tags show up, it must be broken down into the key/value pairs. If the value is set to null, it appears that this is equivalent to removing that tag. I took the specific example you called out and was able to demonstrate how bridge/lending, and dexes show up because polygon-zkevm, and liquidity manager have most recently had their values set to null
Off the top of my head, the best psuedo logic for getting this would probably be to window function where you partition by widget + tag keys and then only take the row_num=1 for when the latest value is not null.
@s-n-park Fantastic! Thank you so much!!!
Hey there! The pipeline is done. I compared developers and widget information with widget data on near.org; randomly chosen widgets have identical data (number of commits, stars, unique dependencies, tags, discussions, website link)
The gold_near_social_widgets table shows widget information for every tx.
The gold_daily_component_stats table shows aggregated information by every widget daily.
Gold_near_social_daily_weekly_monthly_new_developers, gold_near_social_monthly_top_10_code_line_writers, and gold_near_social_retention_rate tables stayed the same.
Description
Create a contribution stats pipeline for the Developer Contribution Board.
Data Needed
Total ForksTotal number of forks their components had during selected timeframeThe pipeline should enable the above table to be created, for all time, this month == last 30 days, and last week == last 7 days
Resources
UX Figma Design
Data can be based on
hive_metastore.near_social_developers_analytics.gold_daily_component_stats
table