mediacloud / web-search

Code that drives the public web-based tools for the Media Cloud Online News Archive and Directory.
https://search.mediacloud.org
Apache License 2.0
10 stars 15 forks source link

Logging middleware #852

Open pgulley opened 1 week ago

pgulley commented 1 week ago

This implements a toggleable Request Log. When enabled, requests are logged to /app/request_log.log- although they also hit the base logstream. One issue is that the logs are only created per-worker, so actually exporting the logfile requires attaching to the web worker and pulling it down- it doesn't seem that there's an easy way to just get a file from a specific worker directly. I've added a script into dokku-scripts that will pull these logs down into an individual file- not sure it's the best option writ large but it will do!

I think the real professional option would be to utilize some third party application to aggregate logs- dokku has support for a couple of different approaches here- but setting something like that seems beyond the scope of this problem.

pgulley commented 1 week ago

I suppose the other hanging potential problem with logging to a file is that we have no way to enforce a retention policy. For certain kinds of aggregate analysis I could imagine wanting logs for over a month or more, so idk how to approach this.

philbudne commented 1 week ago

General comments:

(I started composing this once already in the github UI, but a wayward click caused the screen/text to vanish, not to be found again!), so I'm trying again from the comfort of emacs in a shell window!

The name request_log.log seems redundant to me!

Logging in web-search is ... underdevloped, and needs general attention (see below), and while I understand we might want to regard this as a special case, where you hit the switch and look at a special-purpose log file rather than have to paw thru a general log file...

It may also be that this is SUCH a special case that logging to a file isn't even the right choice, and using redis for pub/sub distribution would be better, and would allow immediate display on a web screen.

I'm trying my best to say this without being pejorative: I want to reiterate, my reaction to creating a single-purpose table which will/can only have one row is visceral and negative. I think it can/will easily get out of hand, and that a general-purpose "properties" table is preferable.

Simply stashing the switch in redis (without a backing table) might make sense (since we'd already be checking redis for cached data). The fact that it would be reset on reboot/restart might even be a feature. It's hard to imagine this is the first time someone has needed such a thing in the django universe!!

Regarding log files (if they're still on the table) I've described how I approached this in rss-fetcher and story indexer in a new web-search issue https://github.com/mediacloud/web-search/issues/853

philbudne commented 1 week ago

And if redis pub/sub is too complicated (and I suspect it is!), a simpler alternative would be to keep a queue in redis, and after appending, prune the oldest entries to keep the queue length within a limit...

philbudne commented 1 week ago

Googling django logging to redis queue gave me this AI generated response(!!)

To achieve Django logging to a Redis queue, you can combine Django's logging capabilities with a Redis queue library like django-rq. Here's a step-by-step guide:

Step 1: Install the Necessary Libraries

pip install django-rq redis

Step 2: Configure Django Settings

Add django-rq to your installed apps: 
   INSTALLED_APPS = [
       # ... other apps
       'django_rq',
   ]
Configure Redis connection for django-rq: 

   RQ_QUEUES = {
       'default': {
           'HOST': 'localhost',
           'PORT': 6379,
           'DB': 0, 
           'DEFAULT_TIMEOUT': 360, 
       }
   }

Step 3: Create a Custom Logging Handler


import logging
import json
from django_rq import get_queue

class RedisQueueHandler(logging.Handler):
    def __init__(self, queue_name='default'):
        super().__init__()
        self.queue = get_queue(queue_name)

    def emit(self, record):
        log_entry = {
            'level': record.levelname,
            'message': record.getMessage(),
            'timestamp': record.created,
            # Add any other relevant fields here
        }
        self.queue.enqueue(json.dumps(log_entry))

Step 4: Configure Django Logging


LOGGING = {
    'version': 1,
    'disable_existing_loggers': False,
    'handlers': {
        'redis': {
            'level': 'DEBUG',  # Adjust the level as needed
            'class': 'path.to.your.RedisQueueHandler',  # Replace with the actual path
        },
    },
    'loggers': {
        'django': {
            'handlers': ['redis'],
            'level': 'DEBUG',  # Adjust the level as needed
            'propagate': True,
        },
    },
}

Step 5: Create a Worker to Process Logs

import json
from django_rq import job

@job
def process_log(log_entry):
    log_data = json.loads(log_entry)
    # Process the log data as needed, e.g., save to a database, send an email, etc.
    print(log_data)

Step 6: Start the RQ Worker

python manage.py rqworker default

Now, your Django logs will be sent to the Redis queue, and the worker will process them as they arrive. Generative AI is experimental.

pgulley commented 1 week ago

I hear you on the config.ini thing- there will certainly be more cases in the future where we want dynamically configurable flags/behaviors set up- and maybe even some features already in use where we would rather use something with some runtime configurability (thinking like a system message, or available date ranges, etc)- so I've gone and spun up a first pass at that and hooked it into the logging middleware. One catch is that the property values are just text- some additional validation on the admin side and the retrieval side would smooth things out a bit.

A redis handler sounds fantastic, thanks for looking up some easy configuration steps there- I guess the destination is the piece we don't really have a great answer for a this moment.

philbudne commented 1 week ago

... and maybe even some features already in use where we would rather use something with some runtime configurability (thinking like a system message, or available date ranges, etc)

So true!!!

A redis handler sounds fantastic, thanks for looking up some easy configuration steps there- I guess the destination is the piece we don't really have a great answer for a this moment.

Well, the configuration was from an AI generated report, so buyer beware! Actually a redis queue isn't quite right; we want an array where you can see all the members, and keep it pruned to a fixed length. Redis has "lists" but I think they're actually "deques" (double ended queues); you can only push or pop from either end, and are probably the primative used for queues.

pgulley commented 1 week ago

Ok, I have the config properties table working as intended now, with the logging_middleware still exporting to a file. I personally feel like the question of where we may want to stream these logs is best left for another PR after we've discussed the options- is logging to files an acceptable stopgap while we figure that out?

philbudne commented 1 week ago

config.ini files have sections. I think having multiple namespaces is more important than a lovely UI, but that's just me! This is the (sqlalchemy) Property model from rss-fetcher:

class Property(Base):
    __tablename__ = 'properties'

    # see property.py for section, key values:
    section = Column(String, primary_key=True, nullable=False)
    key = Column(String, primary_key=True, nullable=False)
    value = Column(String)

Tho too many sections with only a few variables could be just as bad as more variables than fit on a single screen.

One thought on how to section things would be to group them by who would/can access them.

However, another thought has come to mind for controlling whether the messages get logged to disk (keep your eye out for "JSON" below)

Ok, I have the config properties table working as intended now, with the logging_middleware still exporting to a file. I personally feel like the question of where we may want to stream these logs is best left for another PR after we've discussed the options- is logging to files an acceptable stopgap while we figure that out?

Looking at logging.FileHandler it appears to be multi-process (there are 1092 mcweb gunicorn processes running) and multi-thread (there are 36932 mcweb gunicorn threads running):

  1. It flushes (does system write call) after every message logged, and does the flush under a threading lock(!)

  2. It opens files for append, which SHOULD translate to O_APPEND on the system call to open the file (meaning all system write calls should get appended to the file atomically)

That means the files inside each container shouldn't be garbled.

BUT:

  1. The files are scattered in different containers, in container private storage that is likely pruned not long after the container exits. Watching the log in real time with tail is impractical. I feel a script to grab them out of the containers is a crock.

  2. The files will grow without bound not if, but WHEN someone leaves the switch on.

The simple way out is using a (Timed)RotatingFileHandler, but they're not multi-process safe.

My preference here would be to use SysLogHandler and send all the messages to the syslog-sink.py program that story-indexer uses (only we'd need it (and the App class it depends on) to be in a shared repo (mc-common? which could also contain statsd wrappers (send request timings to grafana) for use across our software stacks)).

We could start with just a "requests_handler" using SysLogHander, and later create more that set different facility code arguments to have different things (background tasks) sent to different files.

An alternative to using/enhancing the existing syslog-sink.py would be to rewrite it as a django manange.py subcommand, in which case the config for it could be in settings.py, which initially looks cute, but seems like it could become confusing (each file requires a handler (and at least one logger) that NO other code should ever use!)

I think keeping the log sink separate from Django, and putting the config in a JSON file (on the non-volatile data volume where the log file(s) are written) that's checked periodically for changes would be a better choice; it would allow changes on the fly and the improvement would benefit story-indexer and rss-fetcher.

mcweb isn't rolled out often (usually no more than once a week), and since there are multiple developers, the main branch will often contain untested things, so my feeling is that going out half baked is a poor choice.

Welcome to the world of "there isn't a 15 minute change that can't take (at least) a week to get into production"

P.S. And I get queezy each time I see a file named thing_log.log; log files are usually named thing.log. But I'm also annoyed by my brother calling pesto "pesto sauce"; pesto IS a sauce!

philbudne commented 1 week ago

P.P.S. I was wrong, the story-indexer syslog-sink is NOT an App subclass (would have required special hackery to AVOID having its own log messages sent to itself!)

philbudne commented 1 week ago

P.P.S. I'd be happy to do the syslog-sink work.

pgulley commented 1 week ago

Thanks for your thoughts! I would love to hand that syslog sink task over to you @philbudne - it sounds like there's lots of project-wide improvements to be had there.

pgulley commented 3 days ago

Okay, I've added sections now. I'll circle back to this later in the week.

pgulley commented 3 days ago

Thanks for the syslog.sink work phil- with that merged I think that using that in place of the filelogger is the next step in the order of operations here, and then we can merge and start thinking about the ratelimiter. I'll jump on that this afternoon.

pgulley commented 2 days ago

This now seems to be working with the syslog integration as intended! Ready to merge? @philbudne

philbudne commented 2 days ago

I can't stand trying to write things in the github UI, I inevitably sneeze, and an hour of typing is lost forever (a comment someone once made about emacs when I was young!).

My two most germane thoughts:

Looking at the access.log from pgulley-mcweb, the last five lines weigh in at 23998 characters, so almost 4800 character per entry.

That's a LOT of logging!!

I'd argue that it's beyond what's readable to a human, and if it's all valuable, maybe the file should be written as JSON, so it's easily machine parseable.

Starting from basics, there are two pieces of information that aren't in the web server log: The user, and the POST content, neither of which are large (and, OK a third: perhaps the original IP address, since it might be hidden in an HTTP header... and OK, at that point, just save ALL the HTTP headers? Nobody expects the Spanish Inquisition! https://www.youtube.com/watch?v=QqreRufrkxM).

Here is the last line from the nginx log for pgulley-mcweb (in a standard, if unpleasant machine readable format):

10.1.1.254 - - [19/Nov/2024:16:40:33 -0500] "POST /api/search/total-count HTTP/1.1" 200 47 "http://pgulley-mcweb.ifill.angwin:8080/search?q=test&nq=&start=10-16-2024&end=11-19-2024&p=onlinenews-mediacloud&ss=&cs=34412234&any=any&name=test&edit=false" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:132.0) Gecko/20100101 Firefox/132.0"

And the last line from your access.log (possibly the same query, I can't tell):

2024-11-19 21:40:33,500 df1c265c4f5e INFO: 2024-11-19 21:40:33.499985+00:00 - Method: POST, Path: /api/search/total-count, User: @.***, Duration: 0.0486 s, IP: 172.17.0.1, Params: ParsedQuery(start_date=datetime.datetime(2024, 10, 16, 0, 0), end_date=datetime.datetime(2024, 11, 19, 0, 0), query_str='test', provider_props={'domains': ['nytimes.com', 'washingtonpost.com', 'csmonitor.com', 'usatoday.com', 'latimes.com', 'nypost.com', 'chicagotribune.com', 'chron.com', 'dallasnews.com', 'newsday.com', 'sfgate.com', 'bostonglobe.com', 'startribune.com', 'tampabay.com', 'ocregister.com', 'sacbee.com', 'miamiherald.com', 'kansascity.com', 'baltimoresun.com', 'mercurynews.com', 'jsonline.com', 'orlandosentinel.com', 'sun-sentinel.com', 'dispatch.com', 'post-gazette.com', 'twincities.com', 'omaha.com', 'arkansasonline.com', 'buffalonews.com', 'courant.com', 'talkingpointsmemo.com', 'dailykos.com', 'hotair.com', 'redstate.com', 'politicalwire.com', 'tnr.com', 'factcheck.org', 'realclearpolitics.com', 'reuters.com', 'foxnews.com', 'cnn.com', 'npr.org', 'newsweek.com', 'usnews.com', 'newyorker.com', 'forbes.com', 'fortune.com', 'theatlantic.com', 'motherjones.com', 'harpers.org', 'thenation.com', 'nationalreview.com', 'newsbusters.org', 'rawstory.com', 'wsj.com', 'cleveland.com', 'upi.com', 'thedailybeast.com', 'cbsnews.com', 'cnbc.com', 'salon.com', 'cnet.com', 'time.com', 'techcrunch.com', 'espn.com', 'gizmodo.com', 'rollingstone.com', 'mashable.com', 'eonline.com', 'gawker.com', 'fivethirtyeight.com', 'businessweek.com', 'jezebel.com', 'buzzfeed.com', 'politico.com', 'arstechnica.com', 'opensecrets.org', 'hollywoodreporter.com', 'weeklystandard.com', 'thewrap.com', 'businessinsider.com', 'dailycaller.com', 'variety.com', 'pjmedia.com', 'billboard.com', 'mongabay.com', 'breitbart.com', 'maxpreps.com', 'postandcourier.com', 'slate.com', 'townhall.com', 'rollcall.com', 'bizjournals.com', 'nj.com', 'cnsnews.com', 'livescience.com', 'propublica.org', 'pitchfork.com', 'vanityfair.com', 'c-span.org', 'observer.com', 'stripes.com', 'theweek.com', 'thestreet.com', 'spectator.org', 'pri.org', 'benzinga.com', 'theblaze.com', 'techradar.com', 'truth-out.org', 'kapitall.com', 'deadspin.com', 'mysanantonio.com', 'ajc.com', 'ncregister.com', 'colorlines.com', 'hollywoodlife.com', 'opposingviews.com', 'newsone.com', 'nbcsports.com', 'radaronline.com', 'usmagazine.com', 'bleacherreport.com', 'seattletimes.com', 'azcentral.com', 'stltoday.com', 'creators.com', 'eurweb.com', 'newsmax.com', 'nationalmemo.com', 'patriotpost.us', 'centralmaine.com', 'cbssports.com', 'gazettenet.com', 'rutlandherald.com', 'opednews.com', 'cincinnati.com', 'inquisitr.com', 'indiewire.com', 'investors.com', 'grist.org', 'huffingtonpost.com', 'ibabuzz.com', 'pressrepublican.com', 'pewstates.org', 'rttnews.com', 'therealnews.com', 'barrons.com', 'parade.com', 'ncronline.org', 'breakingnews.com', 'marketwatch.com', 'scientificamerican.com', 'monstersandcritics.com', 'abcnews.go.com', 'zdnet.com', 'staradvertiser.com', 'wired.com', 'bloomberg.com', 'engadget.com', 'ibtimes.com', 'sentinelsource.com', 'theconversation.com', 'syracuse.com', 'tmz.com', 'alternet.org', 'univision.com', 'newsblaze.com', 'elnuevodia.com', 'mlive.com', 'qz.com', 'ign.com', 'sbnation.com', 'airforcetimes.com', 'uproxx.com', 'providencejournal.com', 'fansided.com', 'americanfreepress.net', 'polygon.com', 'elitedaily.com', 'armytimes.com', 'navytimes.com', 'fayobserver.com', 'ew.com', 'norwichbulletin.com', 'oregonlive.com', 'philly.com', 'nhgazette.com', 'pilotonline.com', 'marinetimes.com', 'popmatters.com', 'ledger-enquirer.com', 'fusion.net', 'newsy.com', 'indystar.com', 'accesshollywood.com', 'x17online.com', 'eater.com', 'saturdayeveningpost.com', 'bustle.com', 'hudsonvalley360.com', 'vox.com', 'contactomagazine.com', 'refinery29.com', 'dailysignal.com', 'talkingnewmedia.com', 'inquirer.com', 'theverge.com', 'mintpressnews.com', 'mic.com', 'thepostgame.com', 'recorder.com', 'axcessnews.com', 'oann.com', 'berkshireeagle.com', 'thrillist.com', 'redherring.com', 'axs.com', '247sports.com', 'augustachronicle.com', 'sportingnews.com', 'indiancountrytoday.com', 'cherokeephoenix.org', 'sandiegouniontribune.com', 'theintercept.com', 'pbs.org', 'foxsports.com', 'scout.com', 'talkmedianews.com', 'dailysource.org', 'lexisnexis.com', 'people.com', 'polamjournal.com', 'schwartzreport.net', 'vice.com', 'antena305.com', 'msnbc.com', 'gq.com', 'foxbusiness.com', 'dailydot.com', 'indiancountrynews.com', 'nationalenquirer.com', 'theroot.com', 'denverpost.com', 'hollywoodweeklymagazine.com', 'americaru.com', 'freep.com', 'nbcnews.com', 'theguardian.com'], 'filters': [], 'chunk': True}, provider_name='onlinenews-mediacloud', api_key=None, base_url='http://steinam.angwin:8200/v1/', caching=True)

Finer points:

  1. I regard timestamping as the job of local logging.Handler's logging.Formatter except perhaps if we made the file JSON encoded, in which case it needs to be put in by the sender.

  2. I'd previously expressed concern about the expense of calling parse_query twice, and here we see that it has constructed the domains list (requiring database queries)! And I think the full list is far larger, and far less valuable than the collection ids in terms of characterizing use.

Yeah, we could make sure the DB queries are cached, but it's a hard sell to me that this logging is clearly a job for middleware, and not something that could be in parse_query itself, which is JUST about always called. Calling parse_query feels to me like a layering violation and middleware a tail that's wagging a dog.

The next most important thought:

in mcweb/core/logging_middleware.py:

I'm imagining the caching is a result of your original ad-hoc "table with a single column".

I think the function normally used to retrieve "properties" should be a decorated with mcweb.util.cache.cache_by_kwargs (which, despite the name caches based on the function name, and ALL the args!).

The decorated function might be nothing but a call to a function with "" or "uncached" in front of its name, called in the rare case of someone who HAS to have the latest answer.

Comments on the rest, from the top:

There are WAY to many whitespace changes for my taste, starting with mcweb/backend/search/models.py where it looks like the only change is two blank lines at the end of the file. I see a lot of the new files end this way, but frankly it's something I'd likely remove if I saw it out of the blue.

My opinion is that random formatting changes are disrespectful, and can easily devolve into pissing matches: Making them only invites the next person to make more (including changing things back to how THEY think things should be).

It's only natural to make them, even without thinking, so before I open a P.R. I look at the diffs, and try to remove as many extraneous changes as possible (it's easy to accumulate them in the development process). I realize this code evolved in the PR process...

mcweb/core/admin.py:

The names ConfigProperty and CoreConfig suggest to me that this is intended to be THE way to do config. I've known proponents of database based configuration, and it makes (more) sense to me in large distributed systems where there isn't an obvious single place for configuration, and it makes the ONE piece of configuration that must be set in an instance the database "DSN".

But I'm uneasy with that idea in our context: It's hidden state, making it easier for cases where "it works for me" to happen (I'm reminded of a case recently where deployment of web-search worked for me, but not for Evan. The reason was that the git "remote" URLs in his repo weren't what my scripts expected (and perhaps my scripts had changed, and he had old state). It took me over a week to figure it out, including consulting with the primary Dokku developer (who I think of as Señor Dokku)!

I know it's just a name, but names suggest things to the next person who walks up, and they think they know what the name means.

Fundamentally, it's a design decision that shouldn't be made by one person.

I used the word "property" in the rss-fetcher as intentionally neutral and nebulous!

I'm also of mixed feelings about a "property_type" column: it means that there are two different acts: defining a field, and setting one. I understand it makes the UI nicer, and likely more foolproof. But suggests a model where the database NEEDS to be populated with stuff, which feels rigid to me. I'd ask Rahul for his call: he's likely going to live with it for longer than either of us!

[PB, on a second pass: I learned below that the definitions for the "configuration" are in the settings.py file (the normal location for configuration in Django apps, which I suppose is better than having them in a database migration file!)]

mcweb/settings.py:

REQUEST_LOG_PATH: changes should no longer be needed/used?

The added newlines in the LOGGING dict definition are IMO TOTALLY extraneous. I just edited these lines, and made no additions or removals of whitespace (I replaced explicit defintions with "# see below"). See my comments above, and think to yourself: would I be OK with the next person removing them?

Similarly, you added a hash to a "################" comment line I had put in as a delimeter??! I always put in the same number: 16 (by typing CTRL/U CTRL/U # in emacs; CTRL/U means do the next thing 4 times). I know 17 is revered in some circles (MIT Random House) as the most random number, but REALLY?

CONFIG_DEFAULTS: see above comments on use of the word "config" here. If typing is kept, my preference for it would be "PROPERTY_DEFINITIONS"

IMO: "obs" is WAY to short a name for a config section, and while "observability" is clearer, it feels nebulous, highfalutin, and ill defined to me. Monitoring feels more concrete, but I'm not sure how many variables will fall into either category, again, I'd say it's Rahul's call on what makes sense to him. I'm not a UI designer, and for good reason!

A quick thought I had had about sections would be to see if it made sense to put things together on the basis of who would need to have access to them to make access control simple, but I'm sure someone brighter than I could point out what kinds of messes are at the end of that road!

pgulley commented 2 days ago

simplified log output- it's now fully a dictionary output, and it's just the raw request params so there's no db call and no unwieldy list of sources

cleaned up extraneous line edits

moved caching behind a decorated function call

renamed config_defaults -> property definitions

Happy to tag @rahulbot in on this re: the naming of this module- but 'core' seemed like the lowest hanging fruit to me as it's an app-wide configuration detail.

I think typing belongs here, to keep us safe from user error- we can document around what the names imply but we can't make users (even ourselves) from making errors.

pgulley commented 1 day ago

Added request headers, per the spanish inquisition, and fixed some potential errors,

Log message now looks like this, ex: 2024-11-21 17:49:47,308 9fa2d43fca6c INFO: {"user": "nano3.14@gmail.com", "ip": "********", "method": "POST", "params": {"queryObject": {"query": "npr", "startDate": "10/18/2024", "endDate": "11/21/2024", "collections": [34412234], "sources": [], "platform": "onlinenews-mediacloud"}}, "duration": 0.24114656448364258, "request_time": 1732211387.0651803, "headers": {"Connection": "keep-alive", "Host": "pgulley-mcweb.ifill.angwin:8080", "X-Forwarded-For": "10.1.1.254", "X-Forwarded-Port": "80", "X-Forwarded-Proto": "http", "X-Request-Start": "1732211387.063", "Content-Length": "152", "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:132.0) Gecko/20100101 Firefox/132.0", "Accept": "*/*", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate", "Referer": "http://pgulley-mcweb.ifill.angwin:8080/search?q=npr&nq=&start=10-18-2024&end=11-21-2024&p=onlinenews-mediacloud&ss=&cs=34412234&any=any&name=npr&edit=false", "Content-Type": "application/json", "X-Csrftoken": "g1hTltX8t3sNK33JfKDbihakN4a7syr1", "Origin": "http://pgulley-mcweb.ifill.angwin:8080", "Cookie": "csrftoken=********************; sessionid=q02rsbdvivk02t1a3ctt07toeydut0xs", "Sec-Gpc": "1", "Priority": "u=4"}}

I think maybe we should exclude some of those request values? referrer is redundant, the csrf tokes are just noise

philbudne commented 1 day ago

Referer should be redundant, but it's not SO large that keeping it for a bit is terrible.

It's POSSIBLE keeping all the headers could help debug a client problem (maybe make the enable setting have level numbers?). I'd hope (expect is too strong a word) that Django consults X-Forwarded-for headers when you ask for the client IP addr!

The original URL "path" (gives us the endpoint) is the thing I think is missing:

https://docs.djangoproject.com/en/5.1/ref/request-response/#django.http.HttpRequest.path

SEEMS like it should be what we need.

Also:

maybe consider turning off timestamping in the syslog.yml file (so it's pure JSON) and put a datetime.utcnow.isoformat() as the first thing in the dict, as a nod to humans grep'ing the file?

Next thing you know we'll be loading it into an ELK instance (almost entirely kidding!)

pgulley commented 1 day ago

Good suggestions all- 2024-11-21 20:09:47,511 04bacd83546b INFO: {"timestamp": "2024-11-21T20:09:47.511305", "user": "nano3.14@gmail.com", "ip": "172.17.0.1", "method": "POST", "path": "/api/search/count-over-time", "params": {"queryObject": {"query": "tea", "startDate": "10/18/2024", "endDate": "11/21/2024", "collections": [34412234], "sources": [], "platform": "onlinenews-mediacloud"}}, "duration": 0.08729314804077148, "headers": {"Connection": "keep-alive", "Host": "pgulley-mcweb.ifill.angwin:8080", "X-Forwarded-For": "10.1.1.254", "X-Forwarded-Port": "80", "X-Forwarded-Proto": "http", "X-Request-Start": "1732219787.420", "Content-Length": "152", "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:132.0) Gecko/20100101 Firefox/132.0", "Accept": "*/*", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate", "Referer": "http://pgulley-mcweb.ifill.angwin:8080/search?q=tea%2520&nq=&start=10-18-2024&end=11-21-2024&p=onlinenews-mediacloud&ss=&cs=34412234&any=any&name=tea&edit=false", "Content-Type": "application/json", "Origin": "http://pgulley-mcweb.ifill.angwin:8080", "Sec-Gpc": "1", "Priority": "u=0"}}

The formatter change I added to remove the additional log context doesn't seem to work as documented, but I think this is okay?

philbudne commented 1 day ago

OOPS! I was wrong about the mod being needed in syslog.yml.proto the place to change is mcweb/settings.py

add_syslog_handler needs an additional argument either a "prefix" bool, or an _XXX_FORMATTER constant string...

pgulley commented 1 day ago

ahah, yes I see! working now: {"timestamp": "2024-11-21T21:12:31.000750", "user": "nano3.14@gmail.com", "ip": "172.17.0.1", "method": "POST", "path": "/api/search/sample", "params": {"queryObject": {"query": "tea", "startDate": "10/18/2024", "endDate": "11/21/2024", "collections": [34412234], "sources": [], "platform": "onlinenews-mediacloud"}}, "duration": 0.1697371006011963, "headers": {"Connection": "keep-alive", "Host": "pgulley-mcweb.ifill.angwin:8080", "X-Forwarded-For": "10.1.1.254", "X-Forwarded-Port": "80", "X-Forwarded-Proto": "http", "X-Request-Start": "1732223550.829", "Content-Length": "152", "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:132.0) Gecko/20100101 Firefox/132.0", "Accept": "*/*", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate", "Referer": "http://pgulley-mcweb.ifill.angwin:8080/search", "Content-Type": "application/json", "Origin": "http://pgulley-mcweb.ifill.angwin:8080", "Sec-Gpc": "1", "Priority": "u=4"}}

rahulbot commented 22 hours ago

Responding to being tagged. Did a quick scan but there's a lot to take in here, and loads of discussion. From that scan this looks like it includes a property-management plugin for Django... but doesn't that exit already from some 3rd party? A not-thorough Google suggests django-constance and django-dynamic-preferences as potential options. I hesitate to roll-our-own for things like this that feel like others should have solved them.

If nothing is appropriate from someone else, for reference Wordpress calls a similar table wp_options, and names it "Settings" in their UI (it's a table of key/value pairs). Rails calls this config (at least it did a while ago). Maybe "AppSettings" or "AppProperties" here? (if I've understood the underlying addition being proposed here)

pgulley commented 5 hours ago

django-constance looks like it implements the same pattern I've mocked up here, but it's a full application with all the typing details sorted- so I'll switch over to that.

We get 'sections' slightly differently- but they still exist- and they have type-casting for the admin UI so we can keep things typesafe.

pgulley commented 2 hours ago

Django-constance is a breeze- I have it set up now doing the duty of the core app. I'll have to clean up the remnants of the core app, move the middleware itself somewhere more sensible, but the configuration behavior is now covered by the external dependency.