mozilla / ActiveData-ETL

The ETL process responsible for filling ActiveData
Mozilla Public License 2.0
1 stars 5 forks source link

Use firefox-files to map source.file.name to repo filenames #33

Open klahnakoski opened 6 years ago

klahnakoski commented 6 years ago

The jsdcov filenames are wrong. They are dynamically loaded and given a variety of path prefixes

obj-firefox/dist/include/mozilla/dom/CommentBinding.h
file:///builds/worker/workspace/build/tests/jsreftest/tests/shell.js
chrome://global/content/browser-content.js

Use the firefox-files table in ActiveData to map these filenames to the matching version found in the repo. I imagine the best-matching-suffix-search will do nicely. It is important this mapping is very fast.

Please update jsvm_to_es to use this mapping

klahnakoski commented 6 years ago

ActiveData has a download limit of 50K records. So getting the full file list may be more difficult than it is worth. Instead, find the original artifact and download it.

Here is the firefox-files table: http://activedata.allizom.org/tools/query.html#query_id=lGdtmxm+

The etl property has details about how the record traveled through the ETL pipeline, including sources used to populate it: http://activedata.allizom.org/tools/query.html#query_id=he3n3mcq

We can see recent artifacts with a query to the task table: http://activedata.allizom.org/tools/query.html#query_id=F+y2Bj11

marco-c commented 6 years ago

There is the LCOV rewriter that was written for this purpose and applies the most precise mapping we can have.

klahnakoski commented 6 years ago

The LCOV rewriter requires a local copy of hg yes? If this is to be done by ActiveData's ETL pipeline, it would be preferable to have less dependencies.

I believe we discussed this rewriting is possible as yet-another-post-collection-step on the TC machines. Then the artifacts would have the correct filenames before they enter the pipeline. If it is done on TC, it should not loose the test-level detail we plan to track: https://treeherder.mozilla.org/#/jobs?repo=try&revision=182b519971d39b8c66c6faf61f62b87973d81608

I believe it is best that the translation is done in the ETL pipeline: It keeps the data translation in one place, the translation has less code dependencies, and the coverage generation stays as simple as possible.

klahnakoski commented 6 years ago

Sorry, my last link was not right: Here is the job that generates per-test artifacts: https://treeherder.mozilla.org/#/jobs?repo=try&revision=182b519971d39b8c66c6faf61f62b87973d81608&selectedJob=145407889

marco-c commented 6 years ago

The LCOV rewriter requires a local copy of hg yes? If this is to be done by ActiveData's ETL pipeline, it would be preferable to have less dependencies.

I believe we discussed this rewriting is possible as yet-another-post-collection-step on the TC machines. Then the artifacts would have the correct filenames before they enter the pipeline. If it is done on TC, it should not loose the test-level detail we plan to track: https://treeherder.mozilla.org/#/jobs?repo=try&revision=182b519971d39b8c66c6faf61f62b87973d81608

Yes, the mapping should work no matter if we run all tests or if we run only a subset of them.

I believe it is best that the translation is done in the ETL pipeline: It keeps the data translation in one place, the translation has less code dependencies, and the coverage generation stays as simple as possible.

We'd still translate in the ETL pipeline, but we'd translate using the mapping generated by the coverage task (uploaded as an artifact).