Closed abarbosa94 closed 3 years ago
If your remote task has an object that is not serializable, it cannot be pickled (for example, you cannot serialize thread's locks). This normally happens when you capture the object like this;
lock = threading.Lock()
@ray.remote
def f():
a = lock # Here, lock is captured, and it is not seriaizable, so you cannot pickled this function to run remotely.
# do something
Instead, you can do something like this;
@ray.remote
def f():
lock = threading.Lock()
a = lock # Here, lock doesn't have to be serialized because it is created inside a task.
# do something
I recommend you to check out this document and see if you capture any object that is not serializable. https://docs.ray.io/en/latest/serialization.html#troubleshooting
Thanks a lot for the quick reply :)
By following the troubleshooting link, I was able to find out that HuggingFace datasets are not serializable.
I created an actor that stores the dataset as an attribute of the class. I'm able to perform retrieval in a distributed way successfully right now :)
I'll let my code here if someone faces this challenge in the future.
@ray.remote(num_gpus=0.125)
class DPRRetrieval(object):
def __init__(self):
self.dpr_dataset = load_dataset("text",
data_files=ARC_CORPUS_TEXT,
cache_dir=CACHE_DIR,
split="train[:100%]"
)
self.dpr_dataset.load_faiss_index("embeddings", ARC_CORPUS_FAISS)
torch.set_grad_enabled(False)
def generate_context(self, example):
question_text = example['question']
for option in example['options']:
question_with_option = question_text + " " + option['option_text']
tokenize_text = question_tokenizer(question_with_option, return_tensors="pt")
tokenize_text.to(device)
question_embed = (
question_encoder(**tokenize_text)
)[0][0].cpu().numpy()
_, retrieved_examples = self.dpr_dataset.get_nearest_examples(
"embeddings", question_embed, k=10
)
option["option_context"] = retrieved_examples["text"]
option["option_context"] = " ".join(option["option_context"]).strip()
return example
obj_ids = [DPRRetrieval.remote() for _ in range(num_cpus)]
pool = ActorPool(obj_ids)
examples = [example for example in dataset['validation']]
parallel_result = pool.map(
lambda a, v: a.generate_context.remote(v), examples
)
result = list(tqdm(parallel_result, total=len(examples)))
I'm able to scale a lot of the computation right now!
Closing this issue.
Thanks a lot!
What is the problem?
I'm using https://github.com/huggingface/datasets and I'm trying to implement https://huggingface.co/transformers/model_doc/dpr.html model to the problem that I'm facing. Due to complexity reasons, I'm trying to use Ray for Distributed Retrieval, ispired by the idea from this blogpost: https://medium.com/distributed-computing-with-ray/retrieval-augmented-generation-with-huggingface-transformers-and-ray-b09b56161b1e
I'm applying model inference in GPU and providing dataset retrieval in CPU, but when I do this, I recieve
can't pickle SwigPyObject objects
Ray version and other system information (Python version, TensorFlow version, OS): 1.2.0
Reproduction (REQUIRED)
Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):
The log trace error is the following:
If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".
I guess that this discussion here: https://github.com/huggingface/datasets/issues/1805 can help you