Could not reproduce paper result

zhujl3 commented 7 months ago

We used llama7b, from the deduplicated encommoncrawl dataset, and constructed a 200B training set using the concatenation method mentioned in the paper. We trained the baseline using this 200B data with random concatenation. We encountered three inconsistencies with the original paper.

1.When the training reaches 200B token, we observed that the paper's train loss, baseline is 2.0, ICLM is 1.5, which is lower by 25%; However, in our training, when the training volume reaches 200B, the baseline loss is 2.1, and ICLM loss is also 2.0, which only decreased by 5%. loss_train loss_paper

When testing race-high/race-middle dataset with 2-shot，when training 200B token, the acc in both model seems to lower than 0.25, which is significantly worse than the results reported in the original paper.Also we could not find a obvious difference in other dataset used in paper such as qasper
we also try a continue training, but also we could not get a obvious difference in any dataset except SST2

Could you give some advice. Or tell us how you test the dataset with 2-shot?

swj0419 commented 7 months ago

Thanks for your interest in our work.

(1) Did you train the model from scratch or is this continue pretraining? (2) what are the race-high/middle scores you got?

zhujl3 commented 7 months ago

（1）We both train a model from scratch and continue pretraining. In a scratch version:	Model	race-high	race-mid	boolq_en	race_high_2shot	race_middle_2shot	boolq_en_2_shot
random-from-scratch-200B	0.2	0.23	0.59	0.23	0.25	0.66
iclm-from-scratch-200B	0.2	0.22	0.62	0.17	0.13	0.64

It looks like a random guess

And a continue version:	Model	race-high	race-mid	boolq_en	race_high_2shot	race_middle_2shot	boolq_en_2_shot
iclm-from-ct-100B	0.76	0.83	0.64	0.75	0.82	0.78
random-from-ct-100B	0.76	0.81	0.61	0.74	0.84	0.78

A zeroshot prompt: "race_high": "Answer the question based on the given passage. The answer is the choice of ABCD. Only give me the answer and do not output any other words. The following is the passage.\n\n{input}",

A 2-shot prompt: "race_high_2shot":"Please determine the type of the question below. Here are some examples of questions.\n\n{context}\n{input}",

data format:

"input": "Passage: For some reason, it takes constant reminders that weprimates need nurturing.
In a recent study of 46 babychimpanzee orphans, Kim Bard of the University of Portsmouth in England and her colleagues demonstrated that primate babies that have tight relationships with mother figures do much better oncognitive tests than babies who receive only food, shelter, and friendship with peers. But this is not breaking mews. In fact, it's old news.
In the 1950s, Harry Harlow conducted a series of experiments with baby monkeys that showed, without doubt, that lack of love and comfort makes for a crazy monkey.
Harlow constructed a cage that included a wire monkey \"mother\" topped with a plastic face. In this wire he fixed Mom with a milk bottle. The cage also held another wire mother covered with terry cloth. The baby monkeys spent all their time with the cloth mother and only went to the wire mother to feed, demonstrating that a soft touch beat something to eat any day.
Harlow's monkey work was important because, at the time, child care \"experts\" and everybody's grandmother had a \"no touch, no comfort\" policy toward children. They advised parents not to respond to crying babies, felt babies should sleep alone to grow up independent, and for God's sake put those kids down. But Harlow's work changed all that. Mothers were soon permitted to have their newborns next to them in the hospital.
The current chimp research based on Harlow's work shows that mother love not only makes for a psychologically well-adjusted child, but also makes for a smart kid. Bard and her colleagues evaluated the abilities of the chimps when they were 12 months old with standard human tests for children of that age, tests that ask little kids to imitate some action.
The highly raised chimps did better than the ones that were not loved, and what do you know, the well-raised chimps did even better than human kids on this small IQ test.
So we hear it once again. We are primates, social animals which need care and love. We need to be held and talked to and made to feel that at least one person wants to be with us all the time. And if we get that kind of connection, we are sure to be fine, even better than fine.
Question:Why was Harlow's monkey work important?
A.Because the \"no touch, no comfort\" policy toward children was quite right.
B.Because parents were advised not to respond to babies' crying.
C.Because Harlow's work changed people's former belief in child care.
D.Because mothers were not allowed to have their newborns next to them in the hospital.
Answer:
", 

"context": "Passage: According to Andrew, it never would have happened if he had not had a flat tire on Highway 10 last night at about 7:30. He was on his way to attend a three-day sales meeting when he had the flat. tyre. Unfortunately, he did not have a spare, so he pushed the car off the road, locked it up, and managed to thumb a ride back to Pine Grove. It was after eleven o'clock when he finally got home, and it was then that his real problems started.
When Andrew left home at about 5:30, he had told his wife not to expect him back until Thursday or Friday. Knowing that his wife was nervous about staying in the house alone at night, Andrew took the precaution of checking all the windows in the house to be sure they were locked, so that he could report to his wife that the house was secure. He convinced his wife that the house was burglar-proof, and that she would be perfectly safe, providing she bolted  the front door as soon as he drove away.
Andrew's only thought as he made his way in the dark to his front door was how surprised his wife was going to be to see him, since he was not supposed to be back until Thursday or Friday. He had forgotten about the bolt on the front door. When he turned his key in the lock and the door wouldn't _ he remembered the bolt. And he remembered that he had carefully locked all of the windows.
Although Andrew didn't know it at the time, a next-door neighbor had seen him approaching the house and had watched him go up the steps to the front door. In the dark, it was impossible for the neighbor to recognize Andrew, and, besides, the neighbor knew that Andrew had gone out-of-town for a three-day meeting. As a matter of fact, Andrew had asked the neighbor to keep an eye on the house while he was gone.
Finding that he couldn't get in, Andrew began pounding  on the front door to get his wife to open the door. According to Andrew, however, his wife is a very sound sleeper, and he knew it was going to be hard to wake her up. In the meantime, because of all the noise he had been making, the neighbor was convinced that somebody was trying to break into the house; so she called the police.
When we talked to Andrew at the country jail this morning, he said that he still didn't understand how the police managed to circle the house without his seeing them. He stated that he had decided the only way to get in was to break one of the dining room windows, and that he was about, to hurl his briefcase into the window to break it when two of the officers grabbed him from behind.
Andrew could not make the officers believe that he lived there; so they took him off to jail. Apparently, he did succeed in convincing them that they ought to wake up the woman in the house to check his story. But there was no answer when they knocked at the door. He tried to explain to them that his wife was a very sound sleeper, but they concluded there was nobody in the house.
Question:When Andrew was approaching the house  _  .
A.he was sure he would pleasently surprise his wife
B.he was deep in thought
C.he was sure that his neighbor would help him
D.he was worried about how to wake his wife up
Answer:
A

Passage: You may have heard of the book Moby Dick(<<>> ), written by the American author Herman Melville. You may also know that Moby Dick is considered one of the greatest novels ever written. However, it might surprise you to find out Herman Melville was not always a highly regarded author.
Melville's first two novels, Typee and Omoo, were widely read and financially successfully. They were both exciting tales of adventures at sea and experiences with people in foreign lands. Melville became quite famous. However, upon the publication of his third book, Mardi, Melville's popularity began to weaken. He was no longer interested in telling tales of pure adventure, and his writing took on a style that alienated  the general reading public of his time.
Melville published Moby Dick in October of 1851. it was an original novel, combining aspects of sociology and philosophy, which confused readers by its complex symbolism. The book sold poorly.
Melville's next book, Pierre, was almost completely disregarded by the public. Debt frustration and ill health finally forced Melville to take a low-paying job as a customs inspector. Eventually, Melville abandoned prose  and began to write poetry.
The Civil War is the main subject of Melville's poetry. He and his brother made a trip to the front line, and he published a book of poems, Battle-Pieces and Aspects of War, based on this experience.
Melville died in 1891 at the age of 72. at this point, his work had been completely forgotten by the public. His talent was to go unrecognized for the next thirty years. Then, in 1920s, his reputation began to improve as critics and readers rediscovered his work. Today Moby Dick is one of the best-known novels ever penned by an American author.
Question:What were Melville's first two novels mainly about?
A.His travel experience.
B.His successful communication skills.
C.Adventurous experiences in the front line.
D.Adventurous voyages and foreign experiences.
Answer:
D",

The race-high/racde-mid result from scratch is strange. Maybe the prompt format?

swj0419 commented 7 months ago

We haven't explored the continue pretraining setup yet. However, the results from pretraining from scratch that you've shared appear quite unusual. Based on your result, it seems that for the four-choice classification in Race, the performance is nearly at random levels. In contrast, our evaluation of BoolQ on 7B models achieves approximately 70% accuracy. Regarding Race-high, we directly compare the probabilities of each choice, rather than generating options A, B, C, D.

zhujl3 commented 7 months ago

Thank you for your reply ! After discussion, we speculate that the most likely differing factor comes from the variance in training data. Your paper mentions the use of the EN-COMMONCRAWL dataset, but also give a reference to CCNet which indicate a data filtering pipeline.

My question is: (1) If you filter EnCommonCrawl dataset with the method using in CCNet (2) Could you give us the snapshot of commoncraw dataset? Maybe the snapshot difference also has a influence : )

swj0419 / in-context-pretraining

Could not reproduce paper result #5