q-m / scrapyd-k8s

Scrapyd on container infrastructure
MIT License
12 stars 7 forks source link

More persistent logs #28

Open wvengen opened 6 months ago

wvengen commented 6 months ago

Currently, Docker / Kubernetes logs are used for logging. This is sometimes good enough, but in many situations not. These logs are often truncated at night (and potentially more often when grown to a large size) - especially on Kubernetes - so inspecting errors of long-running jobs can be difficult.

Find a way to keep logs around for at least a bit longer, e.g. for the lifetime of the job (as that could be configurable). The focus is on Kubernetes, where this is the most pressing issue.

Note that if you have a large, mature Kubernetes cluster, it likely includes components to handle logs. But for smaller clusters, it brings a lot of overhead, and something else is desired.

Either Kubernetes has some hints or so to keep logs around for longer, or logs need to be stored elsewhere (and also removed by some system).

wvengen commented 6 months ago

A more complex solution (to setup) could be something like https://kube-logging.dev/ that forwards logs to a central place and can store them on object storage. This looks like it's meant to be run once for the cluster (using daemonsets), so maybe not ideal here.

wvengen commented 6 months ago

A simple solution would be that the spider saves logs locally to node-local storage, and pushes them to object storage when the spider is terminated (also when evicted, for example). Then the logs endpoint (#12) needs to have knowledge of it.

wvengen commented 6 months ago

Something like scrapy-logexport could be a simple solution. Ideally scrapyd-k8s could recognize this, find the S3 location and credentials, and be able to serve the logs.

vlerkin commented 5 months ago

@wvengen I discussed persistent logs problem with experienced colleague, he is familiar with Questionmark project because of one of Pythoneers days we had in the past, he suggested that one of interesting and suitable for us solution might be storing logs in a persistent volume. We can create persistent volumes and mount it to all our pods, they will write their own log files there are even if pods die, they will still stay on this persistent volume.

I believe it might overlap with your second idea from comments to this problem.

https://stackoverflow.com/questions/63479814/how-to-store-my-pod-logs-in-a-persistent-storage

What do you think about it?

wvengen commented 5 months ago

Thank you for looking into this! Storing logs in a persistent volume is an interesting idea. Some things that come up in me:

vlerkin commented 5 months ago

Thank you for your thoughts, I will use them as guidance to look deeper into this possible solution, still learning a lot about k8s and can't answer right away:)

vlerkin commented 5 months ago

Can we use the same persistent volume for all pods?

Yes, we can, according to solution that was provided on Stack Overflow and docs. "A PersistentVolume (PV) is a piece of storage in the cluster that has been provisioned by an administrator or dynamically provisioned using Storage Classes. It is a resource in the cluster just like a node is a cluster resource.". So we first create volume in a cluster and then claim some pieces for each deployment.

We'd need a method to cleanup old logs.

Yes, good point.

Volume storage is more expensive than object storage.

Depends on solution we choose. If I understand correctly by object storage you mean something like distributed storage like S3 on AWS? Not something from k8s world, right? If yes, we can rotate logs from volume to object storage.

Jobs may run for longer than a day, so storing the log after the spider finishes may result in lost parts.

This one is not a problem if we use streaming but if we use streaming do we even need volume? Why not just stream logs from k8s to object storage? What is better for us? Cheaper? If we don't want to implement streaming there should be a mechanism to maybe rotate logs after job is done, I need to read a bit more to understand how to implement something like this.

wvengen commented 4 months ago

Curious to hear more about how persistent volumes can be used by different pods at the same time (sometimes running on the same node, sometimes on different nodes).

Object storage is not a full citizen in Kubernetes (see COSI for recent developments, though our cloud provider doesn't support that), but almost always used in conjunction with it.

vlerkin commented 4 months ago

Not sure if I understand your first question, if you follow the link https://stackoverflow.com/questions/63479814/how-to-store-my-pod-logs-in-a-persistent-storage the solution with volume is saying that a volume mounts to each container where they can direct their logs, so it's not like many containers access the same mounted volume, every container has a mounted volume to access.

wvengen commented 4 months ago

every container has a mounted volume to access

Thanks for making this clear! It's what I would expect from the PV implementation. So we would need to dimension the PV so that the largest log that could ever occur, fits. Either the PVC is deleted when the spider job is deleted, or it is kept (so logs can stay around longer), and a separate cleanup process is needed. When the job is running, one can exec and access the logfile. How to read the log when the job is finished?

object storage

With this approach, ideally, there would be a log file per spider run stored on S3, streaming. Unfortunately, you cannot append to an existing file on S3. But you can use multipart upload to upload logs in batches (see e.g. this SO question).

scrapy-logexport can do something like this, though not streaming.

vlerkin commented 4 months ago

We can't just delete PVC because it's coupled with PV, so I would suggest that we can use Logrotate to manage log files. This way we just regularly manage files and then it would be nice to store them somewhere in the cloud like a final destination, so we need a tool to put them there.

Therefore, implementing log rotation to spread the log data over several files and to remove older items is a must. It involves renaming log files on a predefined schedule or when the file reaches a predefined size. Once the specified condition is met, the log file is renamed to preserve its contents and make way for a new file. Typically an auto incrementing number or timestamp is appended to the filename to indicate its time of rotation which is often helpful in narrowing down your search when investigating an issue that occurred on a specific date. After the file is renamed, a new log file with the same name is created to capture the latest entries from the application or service. A cleanup process is also initiated to prevent an accumulation of rotated log files as older logs beyond a specified retention period are removed. This process repeats indefinitely as long as the log rotation mechanism is working.

Here there is a nice tutorial and some additional info about Logrotate tool: https://betterstack.com/community/guides/logging/how-to-manage-log-files-with-logrotate-on-ubuntu-20-04/

wvengen commented 4 months ago

Well, as you mentioned before, a PVC is linked to a single pod. So when a job is finished, there is just one log file of that one spider run (or perhaps multiple attempts in case of an error). So we can just drop the PV when we don't need it anymore - no logrotate comes into play here.

Note that when you make a PVC, a PV is dynamically provisioned.

(also, note that Kubernetes has support for e.g. NFS volumes, which would enable sharing across pods - but I'm not sure our cloud provider supports that - there also seems to be support for this with CSI, e.g. with ceph CSI - so it is possible to use PVCs with multiple attachments, see access modes, but it really depends on the driver whether this is supported)

I think what is possible with CSI, really depends on the cloud environment. And we'd like to not use too specific features for now (like PVCs shared across pods - but maybe that is available everywhere), for ease of migration. I would think that object storage is more portable in that sense.

For a system design, I see three different options.

Storing logs on a persistant volume

Storing logs on object storage 'streaming'

Perhaps this could be more cleanly implemented as a sidecar container (with logs directed to a file), which does the uploading. That would be a more general solution to streaming logs to object storage, not only for Scrapy.

Storing logs on object storage afterwards

vlerkin commented 4 months ago

Hi Willem, could you tell me, please, what is the usage pattern of scrapyd in production? How many spiders are in the cluster? Is there a designated location where the log files are stored?

wvengen commented 4 months ago

What is the usage pattern of scrapyd in production? How many spiders are in the cluster?

After migrating, this will be the case:

Is there a designated location where the log files are stored?

Not yet, but I expect there to be a single bucket for each instance. Scrapyd stores log files prefixed with the spider name, it makes sense to follow this. Scrapyd stores log files with filename (job_id).log, it makes sense to follow this. May be nice to make the log file destination configurable, e.g. with some standard variables for e.g.: scrapyd node name, spider name, job id, date components.

wvengen commented 4 months ago

Note that we're making a generic software component here that could be used by various people, so try to make it useful for general use-cases you can think of (without making it too complex, and making sure the above fits in).

vlerkin commented 4 months ago

On my way of solving the problem, I encountered this problem:

nobody@scrapyd-k8s-7f59f447d4-tjhjj:/opt/app$ skopeo inspect docker://ghcr.io/q-m/scrapyd-k8s-spider-example:latest FATA[0000] Error parsing manifest for image: choosing image instance: no image found in image index for architecture arm64, variant "v8", OS linux

Willem, would it be difficult to add an image for this platform, please, so I could test and run things on k8s locally? I don't have an access to the docker repo, otherwise I could also add an image for this architecture.

wvengen commented 4 months ago

Ah, could point, an arm64 image would be welcome! I've added it. Does it work now?

vlerkin commented 4 months ago

Nope, I guess it is important to have v8, so the target platform that is expected is linux/arm64/v8

When I run a pod with arm64 image I get the error: scrapyd-k8s-spider-example:latest2n:/opt/app$ skopeo inspect docker://ghcr.io/q-m/scrapyd-k8s-spider-example:latest FATA[0000] Error parsing manifest for image: choosing image instance: no image found in image index for architecture arm64, variant "v8", OS linux

wvengen commented 4 months ago

Ah, bummer. From what I could find, linux/arm64 is actually the same as linux/arm64/v8. Could it be that skopeo is not smart enough about that?

skopeo inspect docker://ghcr.io/q-m/scrapyd-k8s-spider-example:main
{
    "Name": "ghcr.io/q-m/scrapyd-k8s-spider-example",
    "Digest": "sha256:f6cad55a6e221c6f3a5f678c2b69af2d4a92fd5af5bff40a3b541ee2d1e457ce",
    "RepoTags": [
        "main",
        "sha-423aea2",
        "0.1.0",
        "latest",
        "sha-df90e1a",
        "sha-31321f4",
        "0.2.0",
        "0.3.0",
        "sha-37bb185",
        "sha-8c70346",
        "sha-0ab3377"
    ],
    "Created": "2024-05-14T06:16:52.618136197Z",
    "DockerVersion": "",
    "Labels": {
        "org.opencontainers.image.created": "2024-05-14T06:16:28.492Z",
        "org.opencontainers.image.description": "Example spider for scrapyd-k8s",
        "org.opencontainers.image.licenses": "MIT",
        "org.opencontainers.image.revision": "0ab33774f0daf19adfe4678497fcaf5b5ec563ba",
        "org.opencontainers.image.source": "https://github.com/q-m/scrapyd-k8s-spider-example",
        "org.opencontainers.image.title": "scrapyd-k8s-spider-example",
        "org.opencontainers.image.url": "https://github.com/q-m/scrapyd-k8s-spider-example",
        "org.opencontainers.image.version": "main",
        "org.scrapy.project": "example",
        "org.scrapy.spiders": "quotes,static"
    },
    "Architecture": "amd64",
    "Os": "linux",
    "Layers": [
        "sha256:b0a0cf830b12453b7e15359a804215a7bcccd3788e2bcecff2a03af64bbd4df7",
        "sha256:72914424168c8ebb0dbb3d0e08eb1d3b5b2a64cc51745bd65caf29c335b31dc7",
        "sha256:d12a047f1c7ea4f8e51322323d769a331c46fbf44d958ccce3359a6bd932f3d1",
        "sha256:ab33f1a2f6621fe008081cae7c61b1ac6343eecd65c7de7345642cc73d4e18eb",
        "sha256:94510a1366bcf144be79366b3176ae5866b43db71c3d3818e3d56e2e19df964b",
        "sha256:7bea44fb428257ced08eaf1d1ac6dc1c32f5a1fc4f211157f62a6724f5b39ea5",
        "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1",
        "sha256:5e0747a411d810d8f7e2c6b3faa2696635600f1c002cac89484bb392182e1de8",
        "sha256:d060d73045d94f1f61578eb8bfaf1c1495463ce8975fdf458a5f28652b730089"
    ],
    "Env": [
        "PATH=/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
        "LANG=C.UTF-8",
        "GPG_KEY=A035C8C19219BA821ECEA86B64E628F8D684696D",
        "PYTHON_VERSION=3.11.9",
        "PYTHON_PIP_VERSION=24.0",
        "PYTHON_SETUPTOOLS_VERSION=65.5.1",
        "PYTHON_GET_PIP_URL=https://github.com/pypa/get-pip/raw/dbf0c85f76fb6e1ab42aa672ffca6f0a675d9ee4/public/get-pip.py",
        "PYTHON_GET_PIP_SHA256=dfe9fd5c28dc98b5ac17979a953ea550cec37ae1b47a5116007395bfacff2ab9",
        "PYTHONPATH=/usr/local/lib/scrapy-spider.egg",
        "SCRAPY_SETTINGS_MODULE=example.settings"
    ]
}

https://github.com/containers/skopeo/issues/1617#issuecomment-1462441179 inspired me to try

docker buildx imagetools inspect ghcr.io/q-m/scrapyd-k8s-spider-example:main
Name:      ghcr.io/q-m/scrapyd-k8s-spider-example:main
MediaType: application/vnd.oci.image.index.v1+json
Digest:    sha256:f6cad55a6e221c6f3a5f678c2b69af2d4a92fd5af5bff40a3b541ee2d1e457ce

Manifests: 
  Name:        ghcr.io/q-m/scrapyd-k8s-spider-example:main@sha256:6c3209673a8a24fcebedb83c9f057601c40fa6ce3345f30b7bbaadacda3875b8
  MediaType:   application/vnd.oci.image.manifest.v1+json
  Platform:    linux/amd64

  Name:        ghcr.io/q-m/scrapyd-k8s-spider-example:main@sha256:2e309079bda896cfeb40302104458ece0839a7d6a01f58db86e67eb78b3f9b57
  MediaType:   application/vnd.oci.image.manifest.v1+json
  Platform:    linux/arm64

  Name:        ghcr.io/q-m/scrapyd-k8s-spider-example:main@sha256:e9ddec335510ba5ee6720808e49822f0f9636bd100142301cc3ca9eddea843e0
  MediaType:   application/vnd.oci.image.manifest.v1+json
  Platform:    unknown/unknown
  Annotations: 
    vnd.docker.reference.digest: sha256:6c3209673a8a24fcebedb83c9f057601c40fa6ce3345f30b7bbaadacda3875b8
    vnd.docker.reference.type:   attestation-manifest

  Name:        ghcr.io/q-m/scrapyd-k8s-spider-example:main@sha256:c2043600be8fac125aa00ac83ab512b7affaaed4ed9b63fa4b7f663773bf096f
  MediaType:   application/vnd.oci.image.manifest.v1+json
  Platform:    unknown/unknown
  Annotations: 
    vnd.docker.reference.digest: sha256:2e309079bda896cfeb40302104458ece0839a7d6a01f58db86e67eb78b3f9b57
    vnd.docker.reference.type:   attestation-manifest

which does show arm64 too. What happens if you use the main tag instead?

vlerkin commented 4 months ago

Yes, thank you, main does work!

vlerkin commented 3 months ago

The solutions I have been working on that complied with the criteria "simple, universal, on Kubernetes, no spider modification"

  1. To add a sidecar container to the Kubernetes configuration in the schedule function, the sidecar container and the spider container have shared volume, spider container redirects logs from stdout/stderr to the volume and sidecar container collects them and streams to Elastic or similar storage that is compatible with streaming. Because the volume is not persistent and the pod with the job runs to completion there is no default window to ship logs so streaming is a better fit with the given configuration.

  2. Another modification on top of the previous solution is to use persistent volume, this way we are not forced to ship logs immediately but persistent volume and its implementation might seem a bit more complex than the previous setting, but it's still a good improvement. Also possible to run a sidecar on the Scrapyd-k8s and access the shared persistent volume (because the usual volume cannot be shared across multiple pods, but persistent can) with the spider pods, but I did get the idea that you don't really like persistent volume solution.

  3. To avoid the problem with the spider pod that is not persistent there is a way to configure a sidecar container in the scrapyd-k8s pod, the one that manages the whole party on the cluster. Because all pods share the same namespace we can try and access other pods' logs from that sidecar and collect them, however, it requires more engineering compared to previous solutions and there is already an existing solution for this situation - FluentBit [https://fluentbit.io/how-it-works/]. It's a daemonSet so K8s deploys a pod with it on each node, it's convenient because it scales automatically when you add new nodes to the cluster, so it's a very good coverage in logs collection. It can be configured to listen to the logs or to just watch specific log files. Fluent will also add metadata like pod name and other to the logs to it's easier to identify or analyse them, and it can ship logs to many destinations. Fluent requires ConfigMap object to configure the details. It does sound like overkill for our small cluster but in term of engineering work it's something that is very resilient to failures and "works from the box" and also open source industry standard.

vlerkin commented 3 months ago

@wvengen you asked me to look into some sort of Webhooks that detect resource change, is this something you had in mind? https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes

vlerkin commented 3 months ago

When the pod with the spider completes the run, it's logs are still available, I just need to find a way to collect logs from code when the pod is done.

wvengen commented 3 months ago

I'll look back at what you've written next week (time permitting, sorry I'm a bit busy these days). Will comment on some small things, and and curious what you would still recommend here.

When the pod with the spider completes the run, it's logs are still available, I just need to find a way to collect logs from code when the pod is done.

That is almost true, but not quite: logs are truncated each night (I think), and the spider can run for longer than a day, so it still needs to do this periodically.

Elastic

A new service for parsing and storing logs, running separate from the spider jobs, is not what I really had in mind for this issue. If we would go the standard-k8s-logging-stack route, one benefit would be that it could integrate well with clusters already having this. A downside is that this requires cluster resources running all the time, increasing costs (in our case, where we have no full logging stack present already).

persistent volume

I think it would be ok to use a persistent volume (not preferred, but perhaps a simple solution). That would be one persistant volume per spider job. There are some size considerations, i.e. if jobs are not deleted automatically and persistent volumes remain, that would add up quickly. And we'd still need to migrate from the persistent volume to object storage.

regarding sidecar container

You can basically run multiple containers in a single pod (as you mention), and indeed when one container finishes, it affects the others. By using a custom entry-point (either need to take care to chain them properly, if there is an existing entry-point, or else perhaps require that the container image doesn't use a custom entry-point), it may be possible to run the spider and then wait for log shipping to finish (or adapt scrapyd-k8s with a custom spider run command to do so, not requiring adapting the entry-point).

regarding detection of changes

Great you found that. Note that in #6, we may want to start listening to changes to spider pods. So if we need to work with listening to changes for log handling, scrapyd-k8s might be a suitable place for some parts here.

vlerkin commented 3 months ago

The Python library for K8s which is being used to set up containers and jobs has a possibility to retrieve logs via core API which is already configured in the code: def read_namespaced_pod_log(self, name, namespace, **kwargs): # noqa: E501

The question is how to use this kind of power correctly and not lose logs that are being truncated at night, I need to think about possible ways, another "difficulty" is to manage multiple containers with jobs, but it seems like we can list all of them and then work with it.

vlerkin commented 3 months ago

After consideration of different solutions we decided to implement persistent volumes. Important to keep in mind that it shouldn't be too big and that we need to ship logs somewhere from it to free up the space if we want to store them longer, or just clean every once in a while when it's almost full, maybe clean old log files first.

vlerkin commented 3 months ago

In the prod environment persistent volume is implemented through configuration of the external storage, it can be a hard drive or S3 or other options. @wvengen are you ok with implementing persistent volume using S3?

wvengen commented 3 months ago

S3 seems the sensible option for long-term storage, yes. If k8s persistent volumes can use S3 (and are supported in general, i.e. not only for a specific cloud provider), then that would be great.

vlerkin commented 3 months ago

From what I learned, it can support S3 from different providers, the only thing I am not sure about is if we can unify the configuration of different S3 providers, for this I need to find out how to configure them, highlight overlapping parts and understand if it's possible to maybe pass credentials as env variables or some other way.

vlerkin commented 3 months ago

Hi @wvengen I was reading Logging Architecture page in Kubernetes and it made me think about: how much resources do we have now for log files? How much do we need at maximum?

You can configure two kubelet configuration settings, containerLogMaxSize (default 10Mi) and containerLogMaxFiles (default 5), using the kubelet configuration file. These settings let you configure the maximum size for each log file and the maximum number of files allowed for each container respectively.

If we can just expand the limits for log files sizes it's the easiest way. Kubernetes is responsible for log rotation when we are talking about stdout/stderr. My assumption is that it wipes logs when they reach the default limits.

It's not preserving logs but just another angle to tackle the problem if we need to let logs to live a bit longer than jobs.

Note: Only the contents of the latest log file are available through kubectl logs. For example, if a Pod writes 40 MiB of logs and the kubelet rotates logs after 10 MiB, running kubectl logs returns at most 10MiB of data.

wvengen commented 3 months ago

Ah, that is very interesting, thank you! Is indeed tangential to the issue, but may actually satisfy our direct need. Thank you for sharing this! We are currently running on Kubernetes v1.26, would that support the beta kubelet config? Is it possible do change logging settings only for scrapyd-k8s spider pods / jobs?

vlerkin commented 3 months ago

These parameters are available starting from v1.21, so yes, it is available for us with v.1.26. It is possible to set these options for containers via Kubelet configuration but there are no settings to enlarge logging space for selected containers, if we apply the settings then all containers on the cluster will consume more resources for logging, unfortunately.

vlerkin commented 3 months ago

Concerning the other solution.

If you want S3 specifically then we don't need persistent volume, but there is a challenge in collecting logs from all pods (which is solved by fluentbit but we are moving further with something custom), I am looking into one possible solution for that but not sure if it's a good one. To write anything to S3 we need to aggregate logs using logstash and find a way how to collect logs and send them to logstash. My current idea on that is to use TCP socket and try to configure worker pods to send logs via TCP socket. Not very easy.

Persistent volume natively supports different storages, if we have multiple pods and we want to collect logs to the same storage, we need to choose any provider that supports ReadWriteMany mode, so many pods can write to the same storage at the same time, this way I can just redirect stdout/stderr to the files that are stored in persistent volume. As an example of a storage like this is Google Cloud Filestore, note that k8s natively supports different file storage providers which are listed here: https://kubernetes.io/docs/concepts/storage/persistent-volumes/#types-of-persistent-volumes

Both solutions have certain limitations in the infrastructure we can use, so it's always a trade off if we don't want to use industry standard approaches for all clusters.

wvengen commented 3 months ago

What about asking Scrapy to log to a file on a persistent volume? There is a standard way to do this, when invoking the spider. Then at the end of the spider run, move the file to object storage. Or, if we can have a persistent volume on object storage, that would be great - but I think that has some drawbacks / corner cases to think about. Regarding persistent volumes and EBS/Azure/GCE, these are deprecated: if there are CSI implementations doing this, that could be useful.

vlerkin commented 3 months ago

On our Wednesday call I suggested to collect logs from code so "moving file at the end of a spider run" is same as collecting logs from code, we need data structures to keep track of running pods, we need to invoke a script that checks status of each pod by a specific schedule, like every 1 min, but at the same time we are facing the problem with logs that are truncated, so this way of solving the problem does not really solve our problem. With quite some engineering it's probably doable, there are corner cases that make it complex.

That is why introducing persistent volumes is a good way to persist logs, compared to the previous way, I think, we ask pods to write to persistent volume and preserve the logs even if a pod crushes or is deleted, we still have it's logs, and if we constantly write to the file, then we don't need to come up with a solution about truncated logs, they will be preserved as well.

Concerning CSI, we still can use the old annotation for disks and it will be automatically redirected to the new abstraction with the driver thing. But I can also spend some time and look into how to deploy this CSI.

In Kubernetes 1.30, all operations for the in-tree awsElasticBlockStore type are redirected to the ebs.csi.aws.com CSI driver.

Are there drawbacks/corner cases with persistent volumes you see right now and want me to think about?

wvengen commented 3 months ago

That is why introducing persistent volumes is a good way to persist logs

Agree! Let's go that route.

CSI

Good to know that old annotations still work - are they 'translated' to CSI by Kubernetes? I think it is useful to look a bit into CSI, also because there are probably limitations that would be good to know about.

Are there drawbacks/corner cases with persistent volumes you see right now and want me to think about?

I do read that S3 with CSI has limitations. I think it is good to see what is possible with that, but also come up with an approach without. So if we can use S3 as PV storage (incl. updates of logs - which I think is harder), that would be great. When I think KISS, I still come to letting Scrapy store logs on a regular persistent volume, and copy it to object storage when the spider is done (e.g. with an entrypoint).

vlerkin commented 3 months ago

What is a regular persistent volume for you? In production it's always implemented through a storage that fits the requirements, if we use a file storage that allows ReadWriteMany and for example pay for that, why would we need another storage like S3 which is not compatible with ReadWriteMany? This way you store same files in two different types of storages and have to pay for both.

If you want precisely S3 and only S3 ( or other object storage) which is not compatible with ReadWriteMany then we don't need a persistent volume, we can aggregate logs with logstash but the challenge is to collect them to logstash (sorry, I am a bit repeating myself, I mentioned this solution above), and logstash can be configured to ship aggregated logs to S3 directly, we don't need an additional step and resources here to set any sort of volume.

I will look into CSI to learn about this abstraction.

wvengen commented 3 months ago

Good question :) I would expect that dynamically provisioned network-attached-storage would be common among Kubernetes cloud providers. I think local storage may or may not work (depending on local disk space), but as you mentioned before, it is not recommended to depend on node-local storage.

I think we have three scenario's now:

  1. A full-fledged ELK stack (didn't talk much about that before, but it is an option). What would be the (resource) costs for this?
  2. A custom solution with logstash storing logs on cheap durable storage (which is S3 in our case). Would this be very involved / complicated / hackish? What would be the (resource) costs for this?
  3. A very basic solution storing Scrapy logs in a file (by configuring Scrapy), and shipping the file to S3 after the spider is finished.
vlerkin commented 2 months ago

Currently I have managed to access logs of job pods from the managing pod with scrapyd, I created a watcher to monitor events and when another pod with a job is running, I ask another watcher to monitor and send it's logs, the problem with the second watcher that it does not really watch, it reads and sends logs and quits until the next event triggers log reading, this is not the behavior I would like to have, so I am looking into this problem now and need an advice from more experienced colleagues.

In the meanwhile I made a workaround, I append logs to a file located in the scrapyd pod with every read, so the file does not get rewritten, and in case logs were truncated, I still have them in the file, then I can delete duplicated lines by running pandas commands and the files are ready to be shipped to S3 but this is a workaround and not watching logs has some potential consequences for logs loss, since reading logs is triggered by events, there is a possibility, that from one event to another, some logs were written and then truncated, or the pod failed and we lose some part, it's probably not a big part and you said we can afford it.

In case I don't manage to make log watcher work as expected, I can present that workaround solution next time. I had to put watcher in a separate thread in flask and I guess there are some details about async and threading that I did not fully grasp and that is why log watcher acts not as desired.

wvengen commented 2 months ago

Thank you for the progress update! I'm curious to see any configuration and code you have - but feel free to polish it a bit more if you like (though I'm a fan of early sharing).

vlerkin commented 2 months ago

Ok, for now we have logs collection to files inside scraped-k8s container, they are collected by watchers (Kubernetes watch and client), instead of cleaning up duplicates we want to keep only unique logs in the files because pandas or other data science frameworks are quite big and do not really belong in our project.

To keep it clean and simple we can track lines that are already added to the file with logs. There are multiple ways to do that, I see one really easy and effective way which I am going to implement and try on the cluster.

We have many lines like this in our job logs:

2024-07-03 09:08:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': 'The fear of death follows from the fear of life. A man who lives fully is prepared to die at any time.', 'author': 'Mark Twain', 'tags': 'death'}
2024-07-03 09:08:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': 'A lie can travel half way around the world while the truth is putting on its shoes.', 'author': 'Mark Twain', 'tags': 'misattributed-mark-twain'}

So we cannot really rely on date time stamp, but if we check the last two lines in the file with job logs we have a unique combination of lines, so then we can parse the job container logs and skip lines until we find a match of those two unique lines and we can append to the log file only lines that come after that. This is an interesting algorithm! Also, hashing long lines will take more time than comparing them symbol by symbol, so I want to leave out hashing from this algorithm.

Future plans we discussed:

Still keeping in mind other possible solutions but working on this one because it does not require heavy frameworks and simple enough for small clusters.