neptune-ai / neptune-client

πŸ“˜ The experiment tracker for foundation model training
https://neptune.ai
Apache License 2.0
574 stars 63 forks source link

Huggingface Trainer closes run automatically after training #1663

Open Ulipenitz opened 6 months ago

Ulipenitz commented 6 months ago

Is your feature request related to a problem? Please describe.

When I use a Huggingface Trainer with a NeptuneCallback, it seems that the Trainer closes the run automatically & thus disconnects it from the python logger. If I want to log anything to Neptune after training, I have to reinitialize the run, which makes the code complex in bigger training pipelines.

Describe the solution you'd like

Would be great if the run persists.

Describe alternatives you've considered

My workaround looks like this:

main.py:

from dotenv import find_dotenv, load_dotenv
import logging
import neptune
from neptune.integrations.python_logger import NeptuneHandler
from training_function import training_function

def setup_main_logger(run, run_id):
    logger = logging.getLogger()  # Get the root logger
    logger.setLevel(logging.INFO)
    formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
    run, neptune_handler = get_neptune_handler(run, run_id, formatter)
    logger.addHandler(neptune_handler)
    return run, logging.getLogger(__name__)

def get_neptune_handler(run, run_id, formatter):
    try:
        run.stop()
    finally:
        run = neptune.init_run(with_id=run_id, capture_stderr=True, capture_stdout=True)
    neptune_handler = NeptuneHandler(run=run)
    neptune_handler.setFormatter(formatter)
    return run, neptune_handler

if __name__ == "__main__":

    # load ENV variables
    load_dotenv(find_dotenv(), override=True)
    NEPTUNE_API_TOKEN = os.environ.get("NEPTUNE_API_TOKEN")
    NEPTUNE_PROJECT = os.environ.get("NEPTUNE_PROJECT")

    # Initialize Neptune run
    run = neptune.init_run(capture_stderr=True, capture_stdout=True)
    run_id = run["sys/id"].fetch()

    # Set up logging
    run, logger = setup_main_logger(run, run_id)
    ...
    logger.info("This logs perfectly to Neptune! ")
    training_function(..., run)
    logger.info("THIS NEVER GETS LOGGED TO NEPTUNE!")
    run, logger = setup_main_logger(run, run_id)
    logger.info("This logs perfectly to Neptune! ")

training_function.py:

from transformers.integrations import NeptuneCallback
from transformers import Trainer
import logging

logger = logging.getLogger()  # root logger

def training_function(..., run) -> None:
    ...
    # Create neptune callback for training logs
    neptune_callback = NeptuneCallback(
        run=run,
        log_parameters=True,
        log_checkpoints="all",
        )

    logger.info("This logs perfectly to Neptune! ")
    # Initialize the trainer using our model, training args & dataset, and train
    trainer = Trainer(
        model=model,
        args=args,
        ...
        callbacks=[neptune_callback],
    )
    logger.info("This logs perfectly to Neptune! ")
    trainer.train()
    logger.info("THIS NEVER GETS LOGGED TO NEPTUNE!")
SiddhantSadangi commented 6 months ago

Hey @Ulipenitz πŸ‘‹ Neptune does indeed automatically stop the run once the training loop is done. However, we do provide multiple options to log additional metadata to the run once training is over. Here is our Transformers integration guide that lists these options πŸ‘‰ https://docs.neptune.ai/integrations/transformers/#logging-additional-metadata-after-training

Please let me know if any of these work for you πŸ€—

Ulipenitz commented 6 months ago

Thanks for the answer @SiddhantSadangi! This is indeed useful to log metadata like test metrics after training. My problem though is that I need to set up the python logger again after the training function. I am training on a remote machine in the cloud & unfortunately capture_stderr=True, capture_stdout=True only captures neptune specific logs, but I want to have all logs in neptune, including the python logger. My proposed workaround with calling setup_main_logger works, but I think it is not a nice solution.

SiddhantSadangi commented 6 months ago

Ah, understood! Yes, this is definitely inconvenient.

I think your workaround does handle this pretty well in the absence of official support for this use case. I'll just suggest using neptune_callback's get_run() method to access the run used by the Transformer callback. This will remove the need for storing the run_id and reinitializing the run.

trainer = Trainer(
    ...
    callbacks=[neptune_callback],
)

logger.info("This will be logged to Neptune")

trainer.train()

logger.info("This won't be logged to Neptune")

run = neptune_callback.get_run(trainer)
neptune_handler = NeptuneHandler(run=run)
logger.addHandler(neptune_handler)
logger.info("This will be logged to Neptune")

Please let me know if this workaround works better for you πŸ™

I will also pass this feedback to the product team βœ