jbloom / Huanan_market_samples_addtl_analysis

additional analysis of metagenomic samples from the Huanan Seafood Market
MIT License
1 stars 0 forks source link

CCoV HeB-G1 and SD-F3 were sampled in raccoon dogs #1

Open zach-hensel opened 7 months ago

zach-hensel commented 7 months ago

In fact, for three of these four animal coronaviruses, the strongest correlation between the number of viral and animal reads is for the animal species known to be infected with the virus...

The animal species known to be infected by HeB-G1 and SD-F3 canine coronaviruses is the raccoon dog. This is an interesting paper for some context on these and related viruses in the animal trade -- https://www.sciencedirect.com/science/article/pii/S2095927323006278 -- SD-F3 is associated with fecal samples from 14 sick raccoon dogs and HeB-G1 from one dead raccoon dog.

image

From Table S1 in https://www.sciencedirect.com/science/article/pii/S0092867422001945

jbloom commented 7 months ago

There is a whole family of related canine alphacoronaviruses, closely related members of which have been found in various canids (a taxonomic family that is colloquially known as "dogs"), including pet dogs, foxes, raccoon dogs, etc.

The exact canine coronavirus(es) found in the Huanan market samples do not appear to be exactly identical to any known reference genome, but fall in this family of coronaviruses infecting dogs.

I aligned reads to two representatives of this family, canine CoVs HeB-G1 and SD-F3 because those were the two canine coronaviruses used as references in Crits-Christoph et al (2023), and I just thought it was simplest to use the same reference genomes for simplicity.

However, if you try mapping to additional genomes, you will see that the reads actually map to multiple different closely related canine alphacoronaviruses, probably because the true virus(es) in the market aren't exactly identical to any of these references.

In fact, there are more reads that map uniquely to two other canine coronaviruses not used by Crits-Christoph et al (2023), namely canine CoVs B363 (MT114541.1) and B447 (MT114540.1). So probably the true virus that these reads are derived from is a bit more closely related to those two canine coronaviruses as opposed to the canine CoVs HeB-G1 and SD-F3 used as references in Crits-Christoph et al (2023) and then in my paper.

While it is true that canine CoVs HeB-G1 and SD-F3 were isolated from raccoon dogs, canine CoVs B363 and B447 were from pet dogs. So basically, the Huanan market samples contain canine CoV(s) closely related to a whole family of other canine CoVs, with probably the closest relationship to ones isolated from pet dogs but also close relationships to ones from raccoon dogs. The reads from that canine CoV from Jan-12-2020 are found mostly in samples (really mostly one sample) that have most of their animal metagenomic content from a dog (Canis lupus familiaris).

If you wanted to, you could try to de novo assemble the genome of this CoV, and probably the best scaffold to use would be the canine CoV B363.

But for the point of my paper, I think the presence of these reads should pretty clearly demonstrate that in fact there are some CoVs at high abundance in the Jan-12-2020 samples, which was sort of the main point. These include a canine CoV that has most of its reads found in samples containing a lot of dog genetic material.

zach-hensel commented 7 months ago

I would not be surprised if dogs and/or raccoon dogs were infected with the related virus. Neither would you. So the issue is plainly stating that a dog was the host while highlighting dog in figures and not noting that raccoon dog is also likely susceptible and detected in the same samples as these viruses at high rates, yet more correlated with a bamboo rat coronavirus. The number of species-positive and virus-positive samples and the median number of species-positive reads in virus-positive samples does not seem to support one species over the other.

For four of the animal coronaviruses (bamboo rat CoV, canine CoV HeB-G1, rabbit CoV HKU14, and canine CoV SD-F3), the samples that have a substantial number of viral reads also have a high content of genetic material from the animal known to be infected by that virus

In fact, for three of these four animal coronaviruses, the strongest correlation between the number of viral and animal reads is for the animal species known to be infected with the virus

the sample with the most viral reads has the largest fraction of its mitochondrial genetic material from the animal known to be infected with that virus

the panel corresponding to the animal species that is known to be infected by that virus is shaded

an association between material from the virus and the animal it is known to infect

(elsewhere) For 4 most abundant animal CoVs in these samples there is association of viral & host animal content

I agree CCoV was found at somewhat higher abundance on 12/January both for number of reads and number of samples. I do not think that all CoVs is an apples-to-apples comparison. B363 and B447 were likewise sampled from animals with viral diarrhea. Disproportionate detection on a machine for skinning animals is not surprising for this type of canine coronavirus but seems less expected for SARS2. Here is SARS2 abundance in different samples from infected raccoon dogs:

image

Fig S4 from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7706974

A single sample shifts Pearson correlation for dog/raccoon dog from 0.51/-0.02 to 0.24/0.46 for HeB-G1.

jbloom commented 7 months ago

I am happy to re-map the reads to B363 and B447 and figure out if there are appreciable reads that map better to HeB-G1 and SD-F3 or if it ineed appears that B363 and B447 are better reference genomes for basically all reads. My quick-and-dirty analyses suggest to me that B363 and B447 are closer to the sequence of the source of the canine CoV reads than HeB-G1 or SD-F3, although all references are quite close but not quite identical to the actual virus(es) that are the source of these reads.

Do you plan to do this comparison for your Crits-Christoph et al (2023) pre-print which is where I am currently drawing the set of reference genomes to use (Table S17 of your pre-print)? If you plan to do it there then I will wait until that is done and then just repeat the above analysis with your updated reference genome set.

zach-hensel commented 7 months ago

I agree with you that it's uncertain if dogs and/or raccoon dogs were infected with canine coronavirus(es). I agree RefSeq supplemented with He et al and Cui et al genomes doesn't encompass all published genomic data; the methods in Crits-Cristoph et al are clear in this respect.

The issue is concluding that "material from some of these animal coronaviruses is associated with the animals they probably infect" without solid evidence dogs are the most probable source; raccoon dogs as a possible host is not mentioned; "associated" is used interchangeably with "correlated" and raccoon dog reads are more correlated with bamboo rat coronavirus reads than either CCoV. However, dog being more correlated with CCoV than a Bamboo Rat Coronavirus hinges on a single sample and even then only with a linear scale. In general, relatively high correlation is observed for the same species regardless of the virus for the four examples considered. Here I highlight everything over 0.20. I don't see how observed correlation is correlated with host species as much as it is with something else.

image

So where is it coming from? The strongest correlation reported for Fig 3A and 3B alike is for bamboo rat with bamboo rat coronavirus. Coloring by stall, you can see that correlation largely arises from both species and virus being disproportionately detected, particularly in one stall. This is the same thing we reported for SARS2, noting that mtDNA counts "were significantly correlated with SARS-CoV-2 [for two species] reflecting their increased detection in wildlife stall A."

image

The stall driving the bamboo rat correlation happens to be the same stall in which 5 of 6 samples with SARS2 reads were identified on 12/Jan/2020, 3 also positive by qPCR. It's perplexing why it's important to compare the relative fraction of viral and mtDNA reads in samples in Huanan market, yet sampling location is somehow "outside the scope" in an analysis written in response to a report that analyzes the spatial distribution of SARS2-positive samples.

Defining a limited scope of mapped genomes and focusing analysis on those with the highest coverage seems reasonable. Mapping without high coverage data to a broader database including partial genomes would add ambiguity without answering additional questions. Briefly checking some reads and looking at your counts for SD-F3 and HeB-G1, I suppose what's represented in these samples is a recombinant CCoV genome not present in the Crits-Cristoph et al set (resulting in correlated counts), and at least one additional genome.

image

Are there inaccurate claims in Crits-Cristoph et al regarding host specificity? The text "we identified close relatives of viruses reported to infect the wildlife species also detected in these samples" and Fig S5 legend: "Canine CoVs (reported in raccoon dogs" are accurate. Other text relates to three viruses that are the subject of Fig 4.

jbloom commented 7 months ago

Thanks for thoughtful comments. Just acknowledging them. I am traveling this week, but will get back to them early next week. Sorry in advance for delay.

jbloom commented 7 months ago

I did some further analysis (see plot immediately below). Basically, the canine CoV reads map to all of canine CoV B363, B447, HeB-G1, and SD-F3. The blue bars show reads mapping at a reasonable identity threshold to any of these, and the orange bars show the reads mapping with better quality to one genome than the others. The code implementing this is on this branch of the repo.

Screenshot 2024-01-29 at 2 10 31 PM

Overall, what I would conclude from this is that there are one or more canine CoVs in the sample that aren't exactly identical to any of these references, but are closest to B363 and B447 rather than HeB-G1 and SD-F3, although substantially more analysis would be needed to see if it's possible to reconstruct the exact sequence of those canine CoVs.

So overall, at this level of analysis, it is still accurate to say that the canine CoV reads reasonably come from a virus known to infect the dog since the most reads map to two viruses isolated from dogs. It is also true that these reads map well to a closely related canine CoV isolated from raccoon dogs. Since dogs and raccoon dogs are both canids, it is quite likely that some viruses may even be able to infect both species.

A next-level analysis would really use these scaffold to try to assemble the actual canine coronaviruses in the samples. But for the case of my analysis, where it clearly says it is just aligning to the genomes defined in Crits-Christoph et al (2023), it is sort of far beyond the current scope to actually try to assemble new CoV genomes rather than just using the ones in that study. If you want to augment that study by trying to assemble the actual CoV genomes in that study I will re-align to them, but otherwise I prefer to just keeping using the same reference set from your study for consistency of reference sets.

More broadly, it is probably true that some of these coronaviruses can infect multiple species, certainly we know that is the case for SARS-CoV-2.

But none of this really alters the conclusions of my study which are:

zach-hensel commented 7 months ago

This is circling back around to none of the metagenomics data being informative. But I think it is informative e.g. yes bamboo rat correlation with bamboo rat CoV has a lot to do with two stalls, but also likely a lot to do with infected bamboo rats absent identifying other susceptible species. Relative counts in the same sample can also be informative; I’d guess dog rather than raccoon dog from the highest-count sample if that were the only sample available.

It’s worth a thought experiment as to whether these data would be informative if these were observed:

Faced with imperfect data of those types, one should consider the sum of the data even if a reason can be found to say any piece is not proof on its own.

You perhaps know that there are reads covering a lineage A defining mutation in a sample other than A20. We didn’t mention this in the paper and I think this demonstrates some caution in drawing conclusions from available information.

jbloom commented 7 months ago

I haven't tried to build any of the SARS-CoV-2 genomes from the market. Is there good coverage for a lineage A sequence in some of the samples?

zach-hensel commented 7 months ago

Definitely not good coverage. Not much for A20 before amplicon sequencing. F46 is the sample with a couple identical reads across C8782T. Very few have coverage across 8782 or 28144 and none with much other than F13 and F54.

From memory there are a handful of other interesting very low coverage mutations. One encoding K417N. A couple show up in uncertain ways elsewhere in China but I didn’t find anything definitively not noise.