jasminelmah / finalproject

0 stars 0 forks source link

project update! help! #5

Open jasminelmah opened 5 years ago

jasminelmah commented 5 years ago

Here's the overview of the plan:

There are very few papers that look into synteny of non-bilaterians. The ones that have take a manual approach to synteny analysis, looking only at one or two 'targets' at a time. The paper I am modelling my approach off of is (Ramos et al. 2012) [https://doi.org/10.1016/j.cub.2012.08.023], the original 'ghost locus' paper that examined sponges and placozoans for synteny.

The paper starts with a list of neighbour genes known to be syntenic with their gene of interest (GOI) in humans. They then classified all genes in their genomes of interest as being orthologous to neighbour genes, orthologous to non-neighbour genes, or species-specific. Then, to identify significant clustering they used the exact binomial test to test whether the observed number of neighbour gene orthologues co-localizing to a scaffold is significantly higher than the expected number.

However for Amphimedon they also (or instead?) did a Monte Carlo simulation, where they simulated the null distribution of neighbour genes in the absence of synteny. I'm not completely sure why they also did this MC, but perhaps because the Amphimedon scaffolds are sub-chromosomal? The p-value for a test of clustering is calculated as the proportion of simulations in which the number of scaffolds occupied by neighbour genes is less than or equal to the actual number observed. This was described in the Ramos supplement, and most of it seems to make sense to me, but not everything. For instance they say that the results are stored in an "amphisimulation" relational database, but what is that?! Google only brings up that paper and 3 random websites.

Sub-issues:

  1. I have definitely bitten off more than I can chew. Given how manual this process is, I will definitely not be getting through the 67 neural genes I've found.
  2. Ramos supplied a link to the scripts but the link no longer works. This will require coding which can take a lot of time.
  3. These papers relied on a list of neighbour genes that have been previously identified, but none of my genes of interest have had any synteny analysis done on them. Just doing this would probably be a project in itself.

Towards issue 3: I have been working to get DAGchainer to work on Farnam. DAGchainer was chosen partly because it doesn't use outdated unavailable file types, but also because relative to other popular programs it should function acceptably on fragmented genomes. From my readings, it sounds like essentially no normal synteny program will work on genomes that are as fragmented as non-bilaterian genomes. But I can use DAGchainer to identify neighbour genes in model organisms first, then go through the more manual procedure outlined in Ramos et al. to search for synteny in non-bilaterian genomes.

DAGchainer seems to be installed BUT the scripts are ten years old and ran on an old version of g++ that so far I haven't been able to get working on Farnam. Switching out the outdated library names for their new libraries allowed the program to run without error - however some warnings did come up. The test data seemed to run without error but one file has output that isn't identical to what was expected. I am still trying to sort this out!

Questions for Casey

Do you have any advice on how I might be able to make this project more realistic in the time frame I have? If I had to do a portion of this project, what part would you most like to see?

Thanks for your help!

jasminelmah commented 5 years ago

Alternatively I can try out Warren Francis' microsynteny script [here](https: 503 //bitbucket.org/wrf/sequences/). Designed to work on highly fragmented genomes with thousands of scaffolds.