tweaselORG / meta

(Currently) only used for the issue tracker.
2 stars 0 forks source link

Implement HAR to PDF library #42

Closed baltpeter closed 7 months ago

baltpeter commented 7 months ago

As explained in https://github.com/tweaselORG/meta/issues/41#issuecomment-1833351873

baltpeter commented 7 months ago

Typst.ts has two main approaches for how it can be used: by importing an "all-in-one" bundle that does a whole bunch of plumbing for you, or by manually doing the necessary work yourself.

With the AIO bundle, compiling a string to PDF is easy:

import { $typst } from '@myriaddreamin/typst.ts/dist/esm/contrib/snippet.mjs';

export const compile = async () => {
    const source = `
= Introduction
#lorem(60)
`;

    return await $typst.pdf({ mainContent: source });
};

However, the maintainer recommends going the manual route. I struggled a bit getting that to work, so I wanted to document what I found out in case anyone else has the same problem. While the code above worked just fine for me, trying to manually recreate that, I always got an entirely blank PDF. I ended up duplicating snippet.mts locally and removing stuff until I finally found the problem. The one thing different between my code and snippet.mts was this:

https://github.com/Myriad-Dreamin/typst.ts/blob/9644b7088696a0d2fb14b5ee664f855270c78fd5/packages/typst.ts/src/contrib/snippet.mts#L534

This ensures (quite opaquely) that ccOptions.beforeBuild is not undefined. ccOptions here is what gets passed to cc.init(). I wasn't doing that, I didn't provide an argument since it isn't required. I didn't investigate this more thoroughly, but from a quick glance, not having beforeBuild seems to cause fonts not to be loaded (https://github.com/Myriad-Dreamin/typst.ts/blob/9644b7088696a0d2fb14b5ee664f855270c78fd5/packages/typst.ts/src/compiler.mts#L162-L186).

Anyway, here is a working minimal example of how to render a PDF with typst.ts using the manual approach:

import { createTypstCompiler } from '@myriaddreamin/typst.ts/dist/esm/compiler.mjs';

export const compile = async () => {
    const mainFilePath = '/main.typ';
    const source = `
= Introduction
#lorem(60)
`;

    const cc = createTypstCompiler();
    await cc.init({ beforeBuild: [] });

    cc.addSource(mainFilePath, source);

    return await cc.compile({ mainFilePath, format: 'pdf' });
};
baltpeter commented 7 months ago

In my legal research (https://github.com/tweaselORG/meta/issues/38), I found one example where the Austrian DPA "printed" a HAR in one of their decisions: https://noyb.eu/sites/default/files/2023-03/Bescheid%20redacted.pdf#page=24

I'll use that as a rough example of how to structure the output.

Other than that, I'll reference Firefox's and Chrome's dev tools.

baltpeter commented 7 months ago

Unfortunately, I'm running into a fairly major issue with Typst quite early on. I'm currently working on defining a template for what requests should be rendered as.

The problem is that we are regularly dealing with very long non-word strings and Typst doesn't handle those well. By default, if it can't find a nice place to break, it will just let them overflow:

image

Now, you can enable hyphenation:

#set text(hyphenate: true)

https://graph.facebook.com/v14.0/app?access_token=138566025676|e251d7dad1e2b26389ad8a43557aa256&fields=supports_implicit_sdk_logging,gdpv4_nux_content,gdpv4_nux_enabled,android_dialog_configs,android_sdk_error_categories,app_events_session_timeout,app_events_feature_bitmask,auto_event_mapping_android,seamless_login,smart_login_bookmark_icon_url,smart_login_menu_icon_url,restrictive_data_filter_params,aam_rules,suggested_events_setting&format=json&sdk=android

image

This is better but the hyphens are very unfortunate, since they may well change the meaning (cf. https://github.com/typst/typst/issues/674). I may be missing something but I haven't found a way to enable hyphenation but hide the actual hyphens.

Worse yet, hyphenation doesn't appear to work in raw/code blocks:

#set text(hyphenate: true)

https://graph.facebook.com/v14.0/app?access_token=138566025676|e251d7dad1e2b26389ad8a43557aa256&fields=supports_implicit_sdk_logging,gdpv4_nux_content,gdpv4_nux_enabled,android_dialog_configs,android_sdk_error_categories,app_events_session_timeout,app_events_feature_bitmask,auto_event_mapping_android,seamless_login,smart_login_bookmark_icon_url,smart_login_menu_icon_url,restrictive_data_filter_params,aam_rules,suggested_events_setting&format=json&sdk=android

image

#set raw(hyphenate: true) produces an error.

https://github.com/typst/typst/issues/1271#issuecomment-1575580135 suggests appending a zero-width space after each character, which is easy enough to do:

#show raw: t => for c in t.text [#c.replace(c, c + sym.zws)]

https://graph.facebook.com/v14.0/app?access_token=138566025676|e251d7dad1e2b26389ad8a43557aa256&fields=supports_implicit_sdk_logging,gdpv4_nux_content,gdpv4_nux_enabled,android_dialog_configs,android_sdk_error_categories,app_events_session_timeout,app_events_feature_bitmask,auto_event_mapping_android,seamless_login,smart_login_bookmark_icon_url,smart_login_menu_icon_url,restrictive_data_filter_params,aam_rules,suggested_events_setting&format=json&sdk=android

image

This works and produces a passable result. And this would be fine if we were producing PDFs for print only. However, if you now copy this text, you also copy all those zero-width spaces, which is just wrong.

And interestingly, these zero-width spaces are rendered with a very much non-zero width when copied into LibreOffice or gedit:

image

baltpeter commented 7 months ago

The same problem applies when using a soft hyphen instead of a zws, except now we get hyphens instead of hard breaks (plus an unnecessary hyphen at the end that I don't quite understand, but anyway):

#show raw: t => for c in t.text [#c.replace(c, c + sym.hyph.soft)]

https://graph.facebook.com/v14.0/app?access_token=138566025676|e251d7dad1e2b26389ad8a43557aa256&fields=supports_implicit_sdk_logging,gdpv4_nux_content,gdpv4_nux_enabled,android_dialog_configs,android_sdk_error_categories,app_events_session_timeout,app_events_feature_bitmask,auto_event_mapping_android,seamless_login,smart_login_bookmark_icon_url,smart_login_menu_icon_url,restrictive_data_filter_params,aam_rules,suggested_events_setting&format=json&sdk=android

image

baltpeter commented 7 months ago

Okay, having had a little time to ponder this over the weekend and considering that there aren't really any other good alternatives to Typst here (https://github.com/tweaselORG/meta/issues/41#issuecomment-1833490234), I guess I am just going to accept this for the moment in the interest of moving along. For now, I'll implement the workaround and we'll have to hope that Typst fixes that sooner rather than later.

baltpeter commented 7 months ago

Here's my current draft for what the export will look like (mashed together from various requests): https://typst.app/project/rOUJm_yR0ubv0bgmgM4rsn

baltpeter commented 7 months ago

I stumbled across annoying differences in how different HAR implementations encode POST data: https://github.com/tweaselORG/TrackHAR/issues/58#issuecomment-1838215940

Based on that investigation, I think we can use the following heuristic: We display params if it is set and non-empty, and text otherwise. In all the experiments, if params was set, either text was empty or params was a parsed version of text (and thus more "useful").

baltpeter commented 7 months ago

I've got a first prototype of this working. Here's the output for a traffic capture from the Airbnb app on Android: PDF

Two more things I need to implement before I think we can consider this good enough for now:

baltpeter commented 7 months ago

Ok, for escaping, as far as I can tell (https://github.com/typst/typst/issues/266), the best way would be to throw everything that could potentially contain Typst syntax into raw blocks. (The issue mentions that you can call .text on them but I think we want to render all instances as monospace anyway, so we shouldn't need that.)

As per the documentation: "Within raw blocks, everything (except for the language tag, if applicable) is rendered as is, in particular, there are no escape sequences." So, this is perfectly fine, for example:

`#set heading(numbering: "")`

image

There is only one issue: What if there are backticks in our string?

Since there are no escape sequences in raw blocks at all, `\`` isn't going to work. But, as per the example in the docs, you can do this:

abc``` `backticks` ```def

The leading and trailing space will be trimmed, yielding:

image

This works for up to two consecutive backticks in the string. But what if there is three or more? That breaks again and I didn't find any documentation or issue advising on what to do in that case.

However I found that we can use our zws hack again: There is only a problem for consecutive backticks. So, if we prepend a zero-width space to every backtick, we're fine again:

``` test`​``test ```

image

baltpeter commented 7 months ago

Yup, escaping seems to work fine now. I've created an evil.har that tries its darndest to break things…

…but doesn't succeed at doing that, anymore: PDF

baltpeter commented 7 months ago

Having had a quick look at the available options for translation libraries, these all seem waaay too complicated for this project. We don't even need the (few) features preact-i18n offers! Guess I'll just roll my own. :D

baltpeter commented 7 months ago

The translations I'm coming up seem utterly ridiculous. But I did compare mine to Firefox's and they are just as ridiculous (well, Firefox just straight up doesn't have translations for many dev tools strings, but the ones that are translated, anyway). shrug

baltpeter commented 7 months ago

Alright, I'll consider this done now.

In the interest of making progress with the actual project and since I could well imagine that we'll iterate on the output some more in the future, I won't release this as standalone library yet but just keep the code in the complaint generator for now.

I've opened https://github.com/tweaselORG/meta/issues/43 to remind us to release a library in the future, though.